home *** CD-ROM | disk | FTP | other *** search
Text File | 1995-03-10 | 175.7 KB | 4,391 lines | [TEXT/MPS ] |
- 6 December 94
- Version 1.30 of pccts
-
- =============================================================================
- This help file is provided without warranty or guarantee of any kind.
- =============================================================================
-
- This help file is available via anonymous FTP at:
-
- Node: everest.ee.umn.edu [128.101.144.112]
- File: /pub/pccts/1.30/NOTES.newbie
-
- Mirror sites for pccts:
-
- Europe:
-
- Node: ftp.th-darmstadt.de [130.83.55.75]
- Directory: pub/programming/languages/compiler-compiler/pccts
-
- According to the FAQ this is updated daily.
-
- Also:
-
- Node: ftp.uu.net
- Directory: languages/tools/pccts
-
- Pre-built binaries for pccts are available in:
-
- Node: everest.ee.umn.edu [128.101.144.112]
- Directory: /pub/pccts/binaries/PC
- Directory: /pub/pccts/binaries/SGI
- Directory: /pub/pccts/binaries/Ultrix4.3
- etc.
-
- Note: There is no guarantee that these binaries will be
- up-to-date. They are contributed by users of these machines
- rather than the pccts developers.
-
- Contributed Files are in:
-
- Node: everest.ee.umn.edu [128.101.144.112]
- Directory: /pub/pccts/contrib
-
- Mail corrections or additions to moog@polhode.com
-
- The format of NOTES.newbie has been changed to make it easier to look for
- changes from one version to the next using difference programs.
-
- Page 2
- ===============================================================================
- Miscellaneous
- -------------------------------------------------------------------------------
- (Item 1)
- ##. NEVER choose rule names, #token names, #lexclass names, #errclass
- names, etc. which coincide with the reserved words of your C or C++
- compiler. Be awake to name collisions with your favorite libraries
- and #include files. One can only imagine the results of definitions like:
-
- #token FILE "file"
-
- const: "[0-9]*"
- (Item 2)
- ##. Tokens begin with uppercase characters. Rules begin with lowercase
- characters.
- (Item 3)
- ##. When passing the name of the start rule to the ANTLR macro don't
- forget to code the trailing function arguments:
-
- /* Not using ASTs */ ANTLR (grammar(),stdin);
- /* Using ASTs */ ANTLR (grammar(&ASTroot),stdin);
-
- /* *** Wrong *** */ ANTLR (grammar,stdin);
- (Item 4)
- ##. When you see a syntax error message that has quotation marks on
- separate lines:
-
- line 1: syntax error at "
- " missing ID
-
- that probably means that the offending element contains a newline.
- (Item 5)
- ##. Even if your C compiler does not support C++ style comments,
- you can use them in the *non-action* portion of the ANTLR source code.
- Inside an action (i.e. <<...>> ) you have to obey the comment
- conventions of your compiler.
- (Item 6)
- ##. To place the C right shift operator (">>") inside an Antlr action
- ("<<...>>") precede it with a backslash: "\>>" If you forget to do
- this you'll probably get the error message:
-
- warning: Missing <<; found dangling >>
-
- No special action is required for the shift left operator.
-
- This doesn't work with #lexaction or #header because the ">>" will be
- passed on to DLG which has exactly the same problem as Antlr. The
- only workaround I found for these special cases was to place the following
- in an #include file "shiftr.h":
-
- #define SHIFTR >>
-
- where it is never seen by either Antlr or DLG. Then I placed a #include
- "shiftr.h" in the #lexaction.
- Page 3
-
- (Item 7)
- ##. The C grammar distributed with pccts in pccts/lang/C has some
- shortcomings. It was written quite a while ago and has not been updated.
- It was written as an exercise, not as an end in itself.
-
- The "proto" program does not invoke a C pre-processor. If your code
- needs the C pre-processor one must invoke it separately. On my system
- one can use "cc -E ..." or "cc -P ..." to direct the output of the C
- pre-processor to the file specified by -o.
-
- The C grammar does not know about #pragma which appears in the #include
- files of some systems.
-
- There are some contributed versions of C grammars on node everest
- in /pub/pccts/contrib. They are "pure" grammars and have no action
- routines.
- (Item 8)
- ##. To place main() in a ".c" file rather than a grammar file (".g")
- place:
-
- #include "stdpccts.h"
-
- before invoking the ANTRLR macro. Contributed by N.F. Ross.
- (Item 9)
- ##. ANTLR counts a line which is continued across a newline using
- the backslash convention as a single line. For example:
-
- #header <<
- #define abcd alpha\
- beta\
- gamma\
- delta
- >>
-
- This will cause line numbers in ANTLR error messages to be off by 3 compared
- to most text editors.
- (Item 10)
- ##. The Purdue Computer Science Department maintains a WWW directory
- which includes a pccts page:
-
- URL http://tempest.ecn.purdue.edu:8001/
- (Item 11)
- ##. In the discussions below one sometimes refers to "k=1" or "k>1". The
- value of k is the number of tokens of lookahead. However it is not
- necessarily the same as the value of the switch "-k" on Antlr's command
- line. The number of tokens of lookahead maintained by Antlr/DLG is the
- maximum of the "-k" switch and the "-ck" switch. Actually this is a
- half-truth. Antlr rounds the maximum to the next higher power of 2 and
- calls this "LL_K". Thus if one were to invoke Antlr with -k=1 -ck=3 the
- value of LL_K (and the number of buffers allocated for lookahead tokens)
- will actually be 4.
- Page 4
-
- (Item 12)
- ##. Suppose one wants to parse files that "include" other files. The
- code in ANTLR (antlr.g) for handling #tokdefs statements demonstrates
- how this may be done.
-
- grammar: ...
-
- | "#tokdefs" QuotedTerm
-
- <<{
-
- zzantlr_state st; /* defined in antlr.h */
- struct zzdlg_state dst; /* defined in dlgdef.h */
- FILE *f;
-
- UserTokenDefsFile = mystrdup(LATEXT(1));
- zzsave_antlr_state(&st);
- zzsave_dlg_state(&dst);
- f = fopen(StripQuotes(LATEXT(1)),"r");
- if ( f==NULL ) {
- warn(eMsg1("cannot open token defs file '%s'",
- LATEXT(1)+1));}
- else {
- ANTLRm( enum_file(), f, PARSE_ENUM_FILE);
- UserDefdTokens = 1;
- }
- zzrestore_antlr_state(&st);
- zzrestore_dlg_state(&dst);
- }>>
-
- The code uses zzsave_antlr_state() and zzsave_dlg_state() to save the state
- of the current parse. The ANTLRm macro specifies a starting rule for ANTLR
- of "enum_file" and starts DLG in the PARSE_ENUM_FILE state rather than the
- default state (which is the current state - whatever it might be). Because
- enum_file() is called without any arguments it appears that enum_file() does
- not use ASTs nor pass back any attributes. Contributed by Terence J. Parr.
- (Item 13)
- ##. If an action becomes too large then it will overflow an ANTLR buffer
- ("... error: action buffer overflow: size 4000").
-
- In cases where the code does NOT contain any references such as #[...],
- #(...), $xxx, #yyy etc. (which requires substitution by Antlr) you can put
- the action in an include file and then place a #include in the action
- This is almost always effective with #lexaction and the main action.
-
- Suggested by David Seidel (dseidel@delphi.com).
-
- In other cases you must re-make Antlr with a larger value for ZZLEXBUFSIZE.
- The change can by made to the default value for ZZLEXBUFSIZE near line 73
- of pccts/h/antlr.h or by adding a statement like:
-
- #define ZZLEXBUFSIZE 8192
-
- to pccts/antlr/antlr.g in the #header.
-
- Splitting an action of a rule into two smaller actions will not work if
- the second action needs to refer to zzlextext.
- Page 5
-
- (Item 14)
- ##. When one is using multiple input files (for example "a.g" and "b.g"
- to generate "a.c" and "b.c") the only way to place file scope information
- in b.c is to place it in #header of the first grammar file. ANTLR won't
- allow file scope information to be copied from b.g into b.c using
- "<<...>>" notation. If one did place a file scope action in the b.g, ANTLR
- would try to interpret it as the fail action of the last rule appearing in
- a.g. (the first grammar file). The workaround is to #include b.c in
- another file which has your file scope declarations. You'll probably
- need to #include "stdpccts.h" before your file scope definitions.
- (Item 15)
- ##. Multiple parsers can coexist in the same application through use of
- the #parser directive (in C output mode). The #parser statement is not used
- with the ANTLR C++ output option because one can simply instantiate a new
- parser object. The statement "#parser xyz" adds the prefix "xyz" to all
- rule names and many pccts defined names. This is done as something of an
- afterthought by creating the #include file remap.h with definitions like
- the following:
-
- #define statement xyz_statement /* a rule redefined */
- #define zztokenLA xyz_zztokenLA /* pccts global redefined */
- #define AST xyz_AST /* pccts typedef redefined */
- #define setwd1 xyz_setwd1 /* token test sets */
- #define zzerr_1 xyz_zzerr_1 /* error sets */
-
-
- Page 6
- ===============================================================================
- Section on switches and options
- -------------------------------------------------------------------------------
- (Item 16)
- ##. Invoking antlr or DLG with nothing else on the command line will
- cause them to print out a switch summary.
- (Item 17)
- ##. Don't forget about the ANTLR -gd option which provides a trace of
- rules which are triggered and exited.
-
- The trace option can be useful in sometimes unexpected ways. For example,
- by suitably defining the macros zzTRACEIN and zzTRACEOUT before the
- #include of "antlr.h" one can accumulate information on how often each
- rule is invoked.
- (Item 18)
- ##. When you want to inspect the code generated by ANTLR you may want to
- use the ANTLR -gs switch. This causes ANTLR to test for a token being
- an element of a lookahead set by using explicit tests with meaningful
- token names rather by using the faster bit oriented operations which are
- difficult to read.
- (Item 19)
- ##. When using the ANTLR -gk option you probably want to use the DLG -i
- option. As far as I can tell neither option works by itself.
- Unfortunately they have different abbreviations so that one can't
- use the same symbol for both in a makefile.
- (Item 20)
- ##. When you are debugging code in the rule section and there is no
- change to the lexical scanner, you can avoid regeneration of scanner.c
- by using the ANTLR -gx option. However some items from stdpccts.h
- can affect the scanner, such as -k -ck and the addition of semantic
- predicates - so this optimization should be used with a little care.
- (Item 21)
- ##. One cannot use an interactive scanner (ANTLR -gk option) with the
- ANTLR infinite lookahead and backtracking options (syntactic predicates).
- (Item 22)
- ##. If you want backtracking, but not the prefetching of characters and
- tokens that one gets with lookahead, then you might want to try using
- your own input routine and then using ANTLRs (input supplied by string)
- or ANTLRf (input supplied by function) rather than plain ANTLR which
- is used in most of the examples.
-
- See Example 4 below for an example of an ANTLRf input function.
- (Item 23)
- ##. The format used in #line directive is controlled by the macro
-
- #define LineInfoFormatStr "# %d \"%s\"\n"
-
- which is defined in generic.h. A change requires recompilation of ANTLR.
-
- With Antlr switch -gl may cause Antlr to sometimes place #line directives
- in a column other than column 1 when processing semantic predicates. The
- temporary workaround is to change the format string to:
-
- #define LineInfoFormatStr "\n# %d \"%s\"\n"
-
- This bug is present in version 1.23.
- (Item 24)
- ##. To make the lexical scanner case insensitive use the DLG -ci
- switch. The analyzer does not change the text, it just ignores case
- when matching it against the regular expressions.
-
- The problem in version 1.10 with the -ci switch is fixed in versions >= 1.20.
- Page 7
-
- (Item 25)
- ##. In order to use a different name for the mode.h file it is necessary
- to supply the new name using both the ANTLR -fm switch and the DLG -m switch.
- ANTLR does not generate mode.h, but it does generate #include statements
- which reference it.
-
- ===============================================================================
- C++ Mode
- -------------------------------------------------------------------------------
- (Item 26)
- ##. Prior to version 1.23 when using backtracking (syntactic predicates)
- ANTLRtoken must have been explicitly derived from ANTLRCommonBacktrackingToken
- rather than ANTLRCommonToken. With version 1.23 Antlr generates a typedef for
- the base class so that the correct one is automatically chosen.
-
- Page 8
- ===============================================================================
- Section on #token, #tokclass, #tokdef #errclass (but not #lexclass)
- -------------------------------------------------------------------------------
- (Item 27)
- ##. If you can't figure out what the DLG lexer is doing try inserting
- the following code near line 434 of pccts/h/dlgauto.h:
-
- #include "string.h"
-
- old--> (*actions[accepts[state]])(); /* invokes action routine */
-
- add--> {char zzcharstring[]="?"; /* put zzchar in string */
- zzcharstring[0]=zzchar;
-
- printf ("\nNLA=%s zzlextext=(%s) zzchar=(%s) %s\n",
- zztokens[NLA], /* token name */
- (strcmp (zzlextext,"\n")==0 ? "newline" : zzlextext),
- /* render \n as "newline" */
- (strcmp (zzcharstring,"\n")==0 ? "newline" : zzcharstring),
- /* render \n as "newline" */
- (zzadd_erase==1 ? "zzskip()" : /* called zzskip() ? */
- zzadd_erase==2 ? "zzmore()" : /* called zzmore() ? */
- "")); /* none of the above */
- };
-
- NLA: the the token number of the token just identified
- this is a macro
- zztokens: array indexed by token number giving the token name
- zzlextext: the text of the token just identified
- zzchar: the lookahead character
- (Item 28)
- ##. To gobble up everything to a newline use: "~[\n]*".
- (Item 29)
- ##. To match any single character use: "~[]".
- (Item 30)
- ##. The char * array "zztokens" in err.c contains the text for the name of
- each token (indexed by the token number). This can be extremely useful
- for debugging and error messages.
- (Item 31)
- ##. If a #token symbol is spelled incorrectly in a rule it will not be
- reported by ANTLR unless the ANTLR -w2 option is set. ANTLR will assign
- it a new #token number which, of course, will never be matched. Look at
- token.h for misspelled terminals or inspect "zztokens[]" in err.c.
- (Item 32)
- ##. If you happen to define the same #token name twice (perhaps
- because of inadvertent duplication of a name) you will receive no
- error messages from ANTLR or DLG. ANTLR will simply use the later
- definition and forget the earlier one. Using the ANTLR -w2 option
- does not change this behavior.
- (Item 33)
- ##. One cannot continue a regular expression in a #token statement across
- lines. If one tries to use "\" to continue the line the lexical analyzer
- will think you are trying to match a newline character.
- (Item 34)
- ##. The escaped literals in #token regular expressions are not identical
- to the ANSI escape sequences. For instance "\v" will yield a match
- for "v", not a vertical tab.
-
- \t \n \r \b - the only escaped letters
- Page 9
-
- (Item 35)
- ##. In #token regular expressions spaces and tabs which are
- not escaped are ignored - thus making it easy to add white space to
- a regular expression.
-
- #token symbol "[a-z A-Z] [a-z A-Z 0-9]*"
- (Item 36)
- ##. You can achieve a limited form of one character lookahead in the
- #token statement action by using zzchar which contains the character
- following the regular expression just recognized. See Example 11.
- (Item 37)
- ##. The regular expressions appearing in #errclass declarations must
- be unique.
- (Item 38)
- ##. You cannot supply an action (even a null action) for a #token
- statement without a regular expression. You'll receive the message:
-
- warning: action cannot be attached to a token name
- (...token name...); ignored
-
- This is a minor problem when the #token is created for use with
- attributes or ASTs nodes and has no regular expression:
-
- #token CAST_EXPR
- #token SUBSCRIPT_EXPR
- #token ARGUMENT_LIST
-
- <<
- ... Code related to parsing
- >>
-
- ANTLR assumes the code block is the action associated with the #token
- immediately preceding it. It is not obvious what the problem is because
- the line number referenced is the END of the code block (">>") rather
- than the beginning. My solution is to follow such #token statements
- with a #token which does have a regular expression (or a rule).
- (Item 39)
- ##. Since the lexical analyzer wants to find the longest possible string
- that matches a regular expression, it is probably best not to use expressions
- like "~[]*" which will gobble up everything to the end-of-file.
- (Item 40)
- ##. Calls to zzskip() and zzmore() should appear only in #token actions
- (or in code called from #token actions). They don't belong in the actions
- of rules. Routine zzskip() causes DLG to throw away the text just
- collected and to start looking for another regular expression. Routine
- zzmore() tells DLG that the token is not complete and to look for more
- text. They are purely lexical actions.
- (Item 41)
- ##. The lexical routines zzmode(), zzskip(), and zzmore() do NOT work like
- coroutines. Basically, all they do is set status bits or fields in a
- structure owned by the lexical analyzer and then return immediately. Thus it
- is OK to call these routines anywhere from within a lexical action. You
- can even call them from within a subroutine called from a lexical action
- routine.
-
- See Example 5 below for routines which maintain a stack of modes.
- Page 10
-
- (Item 42)
- ##. When a string is matched by two #token regular expressions of equal
- length, the lexical analyzer will choose the one which appears first in
- the source code. Thus more specific regular expressions should appear
- before more general ones:
-
- #token HELP "help" /* should appear before "symbol" */
- #token symbol "[a-zA-Z]*" /* should appear after keywords */
-
- Some of these may be caught by using the DLG switch -Wambiguity.
- Consider the following grammar:
-
- #header <<
- #include "charbuf.h"
- >>
- <<
- int main() {
- ANTLR (statement(),stdin);
- return 0;
- }
- >>
-
- #token WhiteSpace "[\ \t]" <<zzskip();>>
- #token ID "[a-z A-Z]*"
- #token HELP "HELP"
-
- statement
- : HELP "@" <<printf("token HELP\n");>> /* a1 */
- | "inline" "@" <<printf("token inline\n");>> /* a2 */
- | ID "@" <<printf("token ID\n");>> /* a3 */
- ;
-
- Is an in-line regular expression treated any differently than a regular
- expression appearing in a #token statement ? No! ANTLR/DLG does *NOT*
- check for a match to "inline" (line a2) before attempting a match to the
- regular expressions defined by #token statements. The first two
- alternatives ("a1" and "a2") will NEVER be matched. All of this will be
- clear from examination of the file "parser.dlg".
-
- Another way of looking at this is to recognize that the conversion of
- character strings to tokens takes place in DLG, not Antlr, and that all
- that is happening with an in-line regular expression is that Antlr is
- allowing you to define a token's regular expression in a more convenient
- fashion - not changing the fundamental behavior.
-
- If one builds the example above using the DLG switch -Wambiguity one gets
- the message:
-
- dlg warning: ambigious regular expression 3 4
- dlg warning: ambigious regular expression 3 5
-
- Page 11
- The numbers which appear in the DLG message refer to the assigned token
- numbers. Examine the array zztokens[] in err.c to find the regular
- expression which corresponds to the token number reported by DLG.
-
- ANTLRChar *zztokens[6]={
- /* 00 */ "Invalid",
- /* 01 */ "@",
- /* 02 */ "WhiteSpace",
- /* 03 */ "ID",
- /* 04 */ "HELP",
- /* 05 */ "inline"
- };
-
- One can also look at the file "scan.c" in which action 4 would
- appear in the function "static void act4() {...}".
-
- The best advice is to follow the example of the Master, TJP, and place
- things like #token ID at the end of the grammar file.
- (Item 43)
- ##. The DLG lexical analyzer is not able to backtrack. Consider the
- following example:
-
- #token "[\ \t]*] <<zzskip();>>
- #token ELSE "else"
- #token ELSEIF "else [\ \t]* if"
- #token STOP "stop"
-
- with input:
-
- else stop
-
- When DLG gets to the end of "else" it realizes that the spaces will allow
- it to match a longer string than "else" by itself. So DLG starts to accept
- the spaces. When DLG gets to the initial "s" in "stop" it realizes it has
- gone too far - but it can't backtrack. It passes back an error status to
- ANTLR which (normally) prints out something like:
-
- invalid token near line 1 (text was 'else ') ...
-
- There is an "extra" space between the "else" and the closing single quote
- mark.
-
- This problem is not detected by the DLG option -Wambiguity.
- Page 12
-
- (Item 44)
- ##. If only one character of lookahead is necessary to distinguish the two
- tokens one can use zzchar. This is an excerpt from Example 11:
-
- #token Range ".."
- #token Int "[0-9]*"
- #token Float "[0-9]*.[0-9]*"
- <<if (*zzendexpr == '.' && /* might use more complex test */
- zzchar == '.') {
- NLA=Int;
- zzmode(LC_Range);
- };
- >>
-
- In this excerpt, a Range can be distinguished from a Float by seeing
- if the first "." is followed by a second ".".
-
- If more than one character of lookahead is necessary and it appears
- difficult to solve using #lexclass, semantic predicates, or other
- mechanisms you might want to consider using the University of California
- Berkeley flex, which is a super-set of lex. An example of how to use
- flex with Antlr is available on everest in /pub/pccts/contrib.
- (Item 45)
- ##. In converting a long list of tokens appearing in a rule to use #tokclass
- I simply replaced the rule, in situ, with the #tokclass directive and did a
- global replace of the rule name with a new name in which the first letter
- was capitalized. It took me a while to realize that the ANTLR message:
-
- xxx.g, line 123: warning: redefinition of tokclass or conflict
- w/token 'Literal'; ignored
-
- meant that I had used the #tokclass "Literal" before it was defined.
- Only rules, not tokens, can be used in forward references. The problem
- was fixed by moving the #tokclass statement up to the #token section of
- the file.
- (Item 46)
- ##. The char * variable zzbegexpr and zzendexpr point to the start and end
- of the string last matched by a regular expression in a #token statement.
-
- However, the char array pointed to by zzlextext may be larger than the
- string pointed to by zzbegexpr and zzendexpr because it includes substrings
- accumulated through the use of zzmore().
- (Item 47)
- ##. The preprocessor symbol ZZCOL in the lexical scanner controls the
- update of column information. This doesn't cause the zzsyn() routine to
- report the position of tokens causing the error. You'll still have to
- write that yourself. The problem, I think, is that, due to look-ahead,
- the value of zzendcol will not be synchronized with the token causing the
- error, so that the problem becomes non-trivial.
- (Item 48)
- ##. If you want to use ZZCOL to keep track of the column position
- remember to adjust zzendcol in the lexical action when a character is not
- one print position wide (e.g. tabs or non-printing characters).
- (Item 49)
- ##. The column information (zzbegcol and zzendcol) is not immediately
- updated if a token's action routine calls zzmore(). In cases where
- zzmore() is central the lexical analysis (e.g. Example 8 which combines
- whitespace with the token that follows) it may be better to write ones own
- column position routine rather than using the pccts supplied code.
- Page 13
-
- (Item 50)
- ##. Variables zzbegcol and zzendcol are the column positions of the
- token just analyzed by DLG. When LL_K=1 this is generally the same as the
- token just analyzed by Antlr. When LL_K > 1 the information in zzbegcol and
- zzendcol will be several tokens ahead of where Antlr is and thus will
- give misleading information.
- (Item 51)
- ##. In version 1.00 it was common to change the token code based on
- semantic routines in the #token actions. With the addition of semantic
- predicates in 1.06 this technique is now frowned upon.
-
- Old style:
-
- #token TypedefName
- #token ID "[a-z A-Z]*"
- <<{if (isTypedefName(LATEXT(1))) NLA=TypedefName;};>>
-
- New Style:
-
- #token ID "[a-z A-Z]*"
-
- typedefName : <<LA(1)==ID ? isTypedefName(LATEXT(1)) : 1>> ID;
-
- The "old" technique is appropriate for making LEXICAL decisions based on
- the input: for instance treating whitespace differently in different
- contexts. The reason why the "new" style is especially important is that
- with infinite lookahead, of which guess mode is one case, it is not
- possible to make semantic decisions in the lexer because the parsing
- doesn't even begin until the lexing is complete.
-
- See the section on semantic predicates for a longer explanation.
- Page 14
-
- (Item 52)
- ##. DLG has no operator like grep's "^" which anchors a pattern to the
- beginning of a line. One can use tests based on zzbegcol only if column
- information is selected (#define ZZCOL) AND one is NOT using infinite
- lookahead mode (syntactic predicates). A technique which does not depend
- on zzbegcol is to look for the newline character and then enter a special
- #lexclass.
-
- Consider the problem of recognizing lines which have a "!" as the first
- character of a line. A possible solution suggested by Doug Cuthbertson
- is:
-
- #token "\n" <<zzline++; zzmode(BEGIN_LINE);>>
-
- *** or ***
-
- #token "\n" <<zzline++;
- if (zzchar=='!') zzmode(BEGIN_LINE);>>
-
- #lexclass BEGIN_LINE
- #token BANG "!" <<zzmode(START);>>
- #token "~[]" <<zzmode(START); zzmore();>>
-
- When a newline is encountered the #lexclass BEGIN_LINE is entered. If
- the next character is a "!" it returns the token "BANG" and returns
- to #lexclass START. If the next character is anything else it calls
- zzmore to accumulate additional characters for the token and, as before,
- returns to #lexclass START. (The order of calls to zzmode() and zzmore()
- is not significant).
-
- There are two limitations to this.
-
- a. If there are other single character tokens which can appear in the first
- column then using zzmore() won't be sufficient to work around the problem
- because the entire (one character) token has already been consumed. Thus
- all single character tokens which can appear in column 1 must appear in
- both #lexclass START and #lexclass BEGIN_LINE.
-
- b. The first character of the first line is not preceded by a newline.
- thus DLG will be starting in the wrong state. Thus you might want to rename
- "BEGIN_LINE" to "START" and "START" to "NORMAL".
-
- Another solution is to use ANTLRf (input from a function) to insert
- your own function to do the kind of lexical processing which is difficult
- to express in DLG.
-
- In 1.20 the macro ANTLRm was added. it is similar to ANTLR, but has an
- extra argument which allows one to specify the lexical class which is
- passed to zzmode() to specify the initial #lexclass state of DLG.
- (Item 53)
- ##. In version 1.10 there were problems using 8 bit characters with DLG.
- Versions >= 1.20 of ANTLR/DLG work with 8 bit character sets when they are
- compiled in a mode in which char variables are by default unsigned (the
- g++ option "-funsigned-char"). This should be combined with a call to
- setlocale (LC_ALL,"") to replace the default locale of "C" with the user's
- native locale. This is system dependent - it works with Unix systems but
- not DOS. Contributed by Ulfar Erlingsson (ulfarerl@rhi.hi.is).
-
- See Example 4 below.
- (Item 54)
- ##. Example 8 demonstrates how to pass whitespace through DLG for
- such applications as pretty-printers.
- Page 15
-
- (Item 55)
- ##. In version 1.30 it will be possible to test whether a token is
- a member of a #tokclass named "A" with a statement like the following:
-
- if (set_el(LA(1),A_set)) {...}
-
- set_el(unsigned,set) is defined in pccts/support/set/set.c
-
- Until that time a workaround is to define all members of a #tokclass
- together so as to take advantage of the knowledge that Antlr assigns
- #token numbers sequentially. With that information one can write:
-
- if (LA(1) >= first_token_in_tokclass_A &&
- LA(1) <= last_token_in_tokclass_A) {...}
-
- (kenw@ihs.com).
-
- Page 16
- ===============================================================================
- Section on #lexclass
- -------------------------------------------------------------------------------
- (Item 56)
- ##. Example 10 gives a simple illustration of #lexclass.
- (Item 57)
- ##. Special care should be taken when using "in-line" regular expressions
- in rules if there are multiple lexical classes #lexclass). ANTLR will
- place such regular expressions in the last lexical class defined. If
- the last lexical class was not START you may be surprised.
-
- #lexclass START
- ....
- #lexclass COMMENT
- ....
-
- inline_example: symbol "=" expression
-
- This will place "=" in the #lexclass COMMENT (where
- it will never be matched) rather than the START #lexclass
- where the user meant it to be.
-
- Since it is okay to specify parts of the #lexclass in several pieces
- it might be a good idea when using #lexclass to place "#lexclass START"
- just before the first rule - then any in-line definitions of tokens
- will be placed in the START #lexclass automatically.
-
- #lexclass START
- ...
- #lexclass A
- ...
- #lexclass B
- ...
- #lexclass START
- (Item 58)
- ##. A good example of the use of #lexclass are the definitions for C
- and C++ style comments, character literals, and string literals which
- can be found in pccts/lang/C/decl.g - or see Example 1 below.
- (Item 59)
- ##. The initial #lexclass of DLG is set by a data statement to START
- (which is 0). Unlike ANTLRm, the traditional ANTLR macros (ANTLRf, ANTLRs,
- and ANTLR) do NOT reset the #lexclass. If you call ANTLR multiple times
- during a program (for instance to parse each statement of a line-oriented
- language independently) DLG will resume in the #lexclass that it was in
- when ANTLR returned. If you want to restart DLG in the START state you
- should precede the call to ANTLR with
-
- zzmode(START);
- or use:
- ANTLRm (myStartRule(),myStartMode);
- Page 17
-
- (Item 60)
- ##. Consider the problem of a grammar in which a statement is composed
- of clauses, each of which has its own #lexclass and in which a given
- word is "reserved" in some clauses and not others:
-
- #1;1-JAN-94 01:23:34;enable;a b c d;this is a comment;
- #2;1-JAN-94 08:01:56;operator;smith;move to another station;
- #3;1-JAN-94 09:10:11;move;old pos=5.0 new pos=6.0;operator request;
- #4;1-JAN-94 10:11:12;set-alarm;beeper;2-JAN-94 00:00:01;
-
- One would like to reuse a #lexclass if possible. There is no problem with
- maintaining a stack of modes (#lexclass numbers) and pushing a new mode
- onto the stack each time a new #lexclass subroutine is called. How to do
- this is demonstrated in Example 5. The problem appears when it is
- necessary to leave a #lexclass and return more than one level. To be more
- specific, a #token action can only be executed when one or more characters
- is consumed - so to return through three levels of #lexclass calls would
- appear to require the consumption of at least three characters. In the
- case of balanced constructs like "...", and '...', or (...) this is not a
- problem since the terminating character can be used to trigger the #token
- action. However, if the scan is terminated by a "separator", such as the
- semi-colon above (";"), one cannot use the same technique. Once the
- semi-colon is consumed it is unavailable for the other #lexclass routines
- on the stack to see.
-
- My solution is to allow the user to specify (during the call to pushMode)
- a "lookahead" routine to be called when the corresponding element of the
- mode stack is popped. At that point the "lookahead" routine can examine
- zzchar to determine whether it also wants to pop the stack, and so on up
- the mode stack. The consumption of a single character can result in
- popping multiple modes from the mode stack based on a single character of
- lookahead. See the second part of Example 5 below.
-
- Continuing with the example of the log file (above): each statement type
- has its fields in a specific order. When the statement type is recognized
- a pointer is set to a list of the #lexclasses which is in the same order as
- the remaining fields of that kind of statement. An action attached to
- every #token which recognizes a semi-colon (";") advances a pointer in
- the list of #lexclasses and then changes the #lexclass by calling zzmode()
- to set the #lexclass for the next field of the statement.
-
- Page 18
- ===============================================================================
- Section on rules
- -------------------------------------------------------------------------------
- (Item 61)
- ##. If you can't figure out what Antlr is doing try adding the -gd
- switch (debug via rule trace) and the -gs switch (perform lookahead
- tests using symbolic names for tokens rather than bit-oriented set
- tests).
- (Item 62)
- ##. Antlr can't handle left-handed recursion. A rule such as:
-
- expr : expr Op expr
- | Number
- | String
- ;
-
- will have to be rewritten to something like this:
-
- expr : Number (Op expr)*
- | String (Op expr)*
- ;
- (Item 63)
- ##. Another sort of transformation required by Antlr is left-factoring:
-
- rule : STOP WHEN expr
- | STOP ON expr
- | STOP IN expr
- :
-
- These are easily distinguishable when k=2, but with a small amount of
- work they can be into a k=1 grammar:
-
- rule : STOP ( WHEN expr
- | ON expr
- | IN expr
- )
- ;
-
- or
- rule : STOP rule_suffix
- ;
- rule_suffix : WHEN expr
- | ON expr
- | IN expr
- ;
-
- An extreme case of a grammar requiring a rewrite is in Example 12.
- (Item 64)
- ##. If a rule is not used (is an orphan) it can lead to unanticipated
- reports of ambiguity. Use the ANTLR cross-reference option (-cr) to
- locate rules which are not referenced. Not verified in version 1.20.
- (Item 65)
- ##. ANTLR attempts to deduce "start" rules by looking for rules which
- are not referenced by any other rules. When it finds such a rule it
- assumes that an EOF token ("@") should be there and adds one if the
- user did not code one. This is the only case, according to TJP, when
- ANTLR adds something to the user's grammar.
- Page 19
-
- (Item 66)
- ##. To express the idea "any single token is acceptable at this point"
- use the "." token wild-card. This can be very useful for providing a
- context dependent error message, rather than the all purpose message
- "syntax error".
-
- if-stmt : IF "\(" expr "\)" stmt
- | IF . <<printf("If statement requires expression ",
- "enclosed in parenthesis");
- PARSE_FAIL;
- >>
-
- It is probably best not to use expressions such as:
-
- ignore: (.)* /* Not a good idea */
-
- which will gobble up everything to the end-of-file.
- (Item 67)
- ##. New to version 1.20 is the "~" operator for tokens. It allows
- one to specify tokens which must NOT match in order to match a rule.
-
- The "~" operator cannot be applied to rules. To express the idea
- "if this rule doesn't match try to match this other rule" use
- syntactic predicates.
- (Item 68)
- ##. Some constructs which are bound to cause warnings about
- ambiguities:
-
- rule : a { ( b | c )* };
-
- rule : a { b };
- b : ( c )*;
-
- rule : a c*;
- a : b { c };
-
- rule : a { b | c | };
- Page 20
-
- (Item 69)
- ##. Don't confuse init-actions with actions which precede a rule
- (leading-actions). If the first element following the start of a rule
- or sub-rule is an action it is always interpreted as an init-action.
-
- An init-action occurs in a scope which include the entire rule or sub-rule.
- An action which is NOT an init-action is enclosed in "{" and "}" during
- generation of code for the rule and has essentially zero scope - the
- action itself.
-
- The difference between an init-action and an action which precedes a rule
- can be especially confusing when an action appears at the start of an
- alternative:
-
- These APPEAR to be almost identical, but they aren't:
-
- b : <<int i=0;>> b1 > [i] /* b1 <<...>> is an init-action */
- | <<int j=0;>> b2 > [j] /* b2 <<...>> is part of the rule */
- ; /* and will cause a compilation error */
-
- On line "b1" the <<...>> appears immediately after the beginning of the
- rule making it an init-action. On line "b2" the <<...>> does NOT appear at
- the start of a rule or sub-rule, thus it is interpreted as an action which
- happens to precede the rule.
-
- This can be especially dangerous if you are in the habit of rearranging
- the order of alternatives in a rule. For instance:
-
- Changing this:
-
- b : <<int i=0,j=0;>> <<i++;>> b1 > [i] /* c1 */
- | <<j++;>> b1 > [i] /* c2 */
- ;
-
- to:
-
- b : /* empty production */ /* d1 */
- | <<int i=0,j=0;>> <<i++;>> b1 > [i] /* d2 */
- | <<j++;>> b1 > [i]
- ;
-
- or to this:
-
- b
- : <<j++;>> b1 > [i] /* e1 */
- | <<int i=0,j=0;>> <<i++;>> b1 > [i] /* e2 */
-
- changes an init-action into a non-init action, and vice-versa.
- Page 21
-
- (Item 70)
- ##. A particularly nasty form of the init-action problem is when
- an empty sub-rule has an associated action:
-
- rule!: ID (
- /* empty */
- <<#0=#[ID,$1.1];>>
- | array_bounds
- <<#0=#[T_array_declaration,$1.1],#1);>>
- )
- ;
-
- Since there is no reserved word in pccts for epsilon, the action
- for the empty arm of the sub-rule becomes the init-action. For
- this reason it's wise to follow one of the following conventions
- (1) represent epsilon with an empty rule "()" or (2) put empty
- as the last rule in a list of alternatives:
-
- rule!: ID (
- () <<#0=#[ID,$1.1];>>
- | array_bounds
- <<#0=#[T_array_declaration,$1.1],#1);>>
- )
- ;
-
- The cost of using "()" to represent epsilon is the execution of the macro
- zzBLOCK() at the start of the sub-rule and zzEXIT() at the end of the
- sub-rule. Macro zzBLOCK() creates a temporary stack pointer for the
- attribute stack and checks for overflow. Macro zzEXIT() pops any
- attributes that might have been placed on attribute stack. Since no
- attribute stack operations take place for epsilon this is wasted CPU
- cycles, however this is probably not a significant cost for many users.
- (Item 71)
- ##. Another form of problem caused by init-action occurs when one
- comments out a rule in the grammar in order to test an idea:
-
- rule /* a1 */
- : <<init-action;> /* a2 */
- //// rule_a /* a3 */
- | rule_b /* a4 */
- | rule_c /* a5 */
-
- In this case one only wanted to comment out the "rule_a" reference
- in line "a3". The reference is indeed gone, but the change has
- introduced an epsilon production - which probably creates a large
- number of ambiguities. Without the init-action the ":" would have
- probably have been commented out also, and ANTLR would report a
- syntax error - thus preventing one from shooting oneself in the foot.
- (Item 72)
- ##. In the case of sub-rules such as (...)+, (...)*, and {...} the
- init-action is executed just once before the sub-rule is entered.
- Consider the following example from section 3.6.1 (page 29) of the 1.00
- manual:
-
- a : <<List *p=NULL;>> // initialize list
- Type
- ( <<int i=0;>> // initialize index
- Var <<append(p,i++,$1);>>
- )*
- <<OperateOn(p);>>
- ;
- Page 22
-
- (Item 73)
- ##. Associativity and precedence of operations is determined by
- nesting of rules. In the example below "=" associates to the right
- and has the lowest precedence. Operators "+" and "*" associate to
- the left with "*" having the highest precedence.
-
- expr0 : expr1 {"=" expr0};
- expr1 : expr2 ("\+" expr2)*;
- expr2 : expr3 ("\*" expr3)*;
- expr3 : ID;
-
- See Example 2.
- (Item 74)
- ##. Fail actions for a rule can be placed after the final ";" of
- a rule. These will be:
-
- "executed after a syntax error is detected but before
- a message is printed and the attributes have been destroyed.
- However, attributes are not valid here because one does not
- know at what point the error occurred and which attributes
- even exist. Fail actions are often useful for cleaning up
- data structures or freeing memory."
-
- (Page 29 of 1.00 manual)
-
- Example of a fail action:
-
- a : <<List *p=NULL;>>
- ( Var <<append(p,$1);>> )+
- <<operateOn(p);rmlist(p);>>
- ; <<rmlist(p);>>
- ************** <--- Fail Action
- (Item 75)
- ##. When you have rules with large amounts of lookahead (that may
- cross several lines) you can use the ANTLR -gk option to make an
- ANTLR-generated parser delay lookahead fetches until absolutely
- necessary. To get better line number information (e.g. for error
- messages or #line directives) place an action which will save
- "zzline" in a variable at the start of the production where you
- want better line number information:
-
- a : <<int saveCurrentLine;>>
- <<saveCurrentLine = zzline;>> A B C
- << /* use saveCurrentLine not zzline here */ >>
- | <<saveCurrentLine = zzline;>> A B D
- << /* use saveCurrentLine not zzline here */ >>
- ;
-
- After the production has been matched you can use saveCurrentLine
- rather than the bogus "zzline".
-
- Contributed by Terence "The ANTLR Guy" Parr (parrt@acm.org)
-
- In version 1.20 a new macro, ZZINF_LINE(), was added to extract line
- information in a manner similar to LATEXT when using infinite lookahead
- mode. See the page 6 of the 1.20 release notes for more information.
- There is nothing like ZZINF_COL() for column information, but it should
- be easy to create using ZZINF_LINE() as a model. Maybe.
- (Item 76)
- ##. An easy way to get a list of the names of all the rules is
- to grep tokens.h for the string "void" or edit the output from ANTLR
- run with the -cr option (cross-reference).
- Page 23
-
- (Item 77)
- ##. It took me a while to understand in an intuitive way the difference
- between full LL(k) lookahead given by the ANTLR -k switch and the
- linear approximation given by the ANTLR -ck switch. This was in spite
- of the example given in section 5 (pages 18 to 21) of the 1.10 release notes.
-
- Most of the time I run ANTLR with -k 1 and -ck 2. Because I didn't
- understand the linear approximation I didn't understand the warnings about
- ambiguity. I couldn't understand why ANLTR would complain about something
- which I thought was obviously parse-able with the lookahead available.
- Was it a bug or was it me ? I would try to make the messages go away
- totally, which was sometimes very hard. If I had understood the linear
- approximation I might have been able to fix them easily or at least have
- realized that there was no problem with the grammar, just with the
- limitations of the linear approximation.
-
- I will restrict the discussion to the case of "-k 1" and "-ck 2".
-
- Consider the following example:
-
- rule1 : rule2a | rule2b | rule2c ;
- rule2a : A X | B Y | C Z ;
- rule2b : B X | B Z ;
- rule2c : C X ;
-
- It should be clear that with the sentence being only two tokens this
- should be parseable with LL(2).
-
- Instead, because k=1 and ck=2 ANTLR will produce the following messages:
-
- /pccts120/bin/antlr -k 1 -gs -ck 2 -gh example.g
- Antlr parser generator Version 1.20 1989-1994
- example.g, line 23: warning: alts 1 and 2 of the rule itself
- ambiguous upon { B }, { X Z }
- example.g, line 23: warning: alts 1 and 3 of the rule itself
- ambiguous upon { C }, { X }
-
- The code generated resembles the following:
-
- if (LA(1)==A || LA(1)==B || LA(1)==C) &&
- (LA(2)==X || LA(2)==Y || LA(2)==Z) then rule2a()
-
- else if (LA(1)==B) &&
- (LA(2)==X || LA(2)==Y) then rule2b()
-
- else if (LA(1)==C) &&
- (LA(2)==Z) then rule3a()
- ...
-
- This might be called "product-of-sums". There is an "or" part for
- LA(1), an "or" part for LA(2), and they are combined using "and".
- To match, the first lookahead token must be in the first set and the second
- lookahead token must be in the second set. It doesn't matter that what
- one really wants is:
-
- Page 24
- if (LA(1)==A && LA(2)==X) ||
- (LA(1)==B && LA(2)==Y) ||
- (LA(1)==C && LA(2)==Z) then rule2a()
-
- else if (LA(1)==B && LA(2)==X) ||
- (LA(1)==B && LA(2)==Z) then rule2b()
-
- else if (LA(1)==C && LA(2)==X) then rule2c()
-
- This happens to be "product-of-sums" but the real problem is that each
- product involves one element from LA(1) and one from LA(2) and as the
- number of possible tokens increases the number of terms grows as N**2.
- With the linear approximation the number of terms grows (surprise)
- linearly in the number of tokens.
-
- ANTLR won't do this with k=1, it will only do "product-of-sums". However,
- all is not lost - you simply add a few well chosen semantic predicates
- which you have computed using your LL(k>1), all purpose, carbon based,
- analog computer.
-
- The linear approximation selects for each branch of the "if" a set which
- MAY include more than what is wanted. It never selects a subset of the
- correct lookahead sets! We simply insert a hand-coded version of the
- LL(2) computation. It's ugly, especially in this case, but it fixes the
- problem. In large grammars it may not be possible to run ANTLR with k=2,
- so this fixes a few rules which cause problems. The generated parser may
- run faster because it will have to evaluate fewer terms at execution time.
-
- <<
- int bypass_rule2a() {
- if ( LA(1)==B && LA(2)==Y ) return 0;
- if ( LA(1)==B ) return 1;
- if ( LA(1)==C && LA(2)==X ) return 1;
- return 0;
- }
- >>
-
- rule1 :
- <<bypass_rule2a()>>? rule2a | rule2b | rule2c ;
- rule2a : A X | B Y | C Z ;
- rule2b : B X | B Z ;
- rule2c : C X ;
-
- The real cases I've coded have shorter code sequences in the semantic
- predicate. I coded this as a function to make it easier to read and
- because there is a bug in 1.1x and 1.2x which prevents semantic predicates
- from crossing lines. Another reason to use a function (or macro) is to
- make it easier to read the generated code to determine when your semantic
- predicate is being hoisted too high (it's easy to find references to a
- function name with the editor - but difficult to locate a particular
- sequence of "LA(1)" and "LA(2)" tests. Predicate hoisting is a separate
- issue which is described elsewhere in this note.
-
- Page 25
- In some cases of reported ambiguity it is not necessary to add semantic
- predicates because no VALID token sequence could get to the wrong rule.
- If the token sequence were invalid it would be detected by the grammar
- eventually, although perhaps not where one might wish. In other cases
- the only necessary action is a reordering of the ambiguous rules so
- that a more specific rule is tested first. The error messages still
- appear, but one can ignore them or place a trivial semantic predicate
- (i.e. <<1>>? ) in front of the later rules. This makes ANTLR happy
- because it thinks you've added a semantic predicate which fixes things.
-
- Some constructs just invite problems. For instance in C++ with a suitable
- definition of the class "C" one can write:
-
- C a,b,c /* a1 */
- a.func1(b); /* a2 */
- a.func2()=c; /* a3 */
- a = b; /* a4 */
- a.operator =(b); /* a5 */
-
- Statement a5 happens to place an "=" (or any of the usual C++ operators)
- in a token position where it can cause a lot of ambiguity in the lookahead.
- set. I eventually solved this particular problem by creating a special
- #lexclass for things which follow "operator". I use an entirely different
- token number for such operators - thereby avoiding the whole problem.
-
- //
- // C++ operator sequences
- //
- // operator <type_name>
- // operator <special characters>
- //
- // There must be at least one non-alphanumeric character between
- // "operator" and operator name - otherwise they would be run
- // together - ("operatorint" instead of "operator int")
- //
-
- #lexclass LEX_OPERATOR
- #token FILLER_C1 "[\ \t]*"
- <<zzskip();
- if( isalnum(zzchar) ) zzmode(START);
- >>
- #token OPERATOR_STRING "[\+\-\*\/\%\^\&\|\~\!\=\<\>]*"
- <<zzmode(START);>>
- #token FILLER_C2 "\(\) | \[\] "
- <<NLA=OPERATOR_STRING;zzmode(START);>>
-
-
- Page 26
- ===============================================================================
- Section on Attributes
- -------------------------------------------------------------------------------
- (Item 78)
- ##. With version 1.30 one will no longer have to refer to attributes or
- ASTs of a rule using numbers.
-
- prior to version 1.30:
- rule : X Y Z <<printf("%s %s %s\n",$1,$2,$3);>>
-
- with version 1.30:
- rule : x:X y:Y z:Z <<printf("%s %s %s\n",$x,$y,$z);>>
-
- Many of the examples in this section need to be revised to reflect the
- use of symbolic tags.
- (Item 79)
- ##. Attributes are built automatically only for terminals. For
- rules (non-terminals) one must assign an attribute to $0, use the
- $[token,...] convention for creating attributes, or use zzcr_attr().
- (Item 80)
- ##. The way to access the text (or whatever) part of an attribute
- depends on the way the attribute is stored.
-
- If one uses the pccts supplied routine "pccts/h/charbuf.h" then
-
- id : "[a-z]+" <<printf("Token is %s\n",$1.text);>>
-
- If one uses the pccts supplied routine "pccts/h/charptr.c" and
- "pccts/h/charptr.h" then:
-
- id : "[a-z]+" <<printf("Token is %s\n",$1);>>
-
- If one uses the pccts supplied routine "pccts/h/int.h" (which
- stores numbers only) then:
-
- number : "[0-9]+" <<printf ("Token is %d\n",$1);>>
-
- Note the use of %d rather than %s in the printf() format.
- (Item 81)
- ##. The expression $$ refers to the attribute of the named rule.
- The expression $0 refers to the attribute of the the enclosing rule,
- (which might be a sub-rule).
-
- rule : a b (c d (e f g) h) i
-
- For (e f g) $0 becomes $3 of (c d ... h). For (c d ... h) $0 becomes
- $3 of (a b ... i). However $$ always is equivalent to $rule.
- (Item 82)
- ##. If you define a zzcr_attr() or zzmk_attr() which allocates resources
- such as strings from the heap don't forget to define a zzd_attr() routine
- to release the resources when the attribute is deleted.
- (Item 83)
- ##. Attributes go out of scope when the rule or sub-rule that defines
- them is exited. Don't try to pass them to an outer rule or a sibling
- rule. The only exception is $0 which may be passed back to the containing
- rule as a return argument. However, if the attribute contains a pointer
- which is copied (e.g. pccts/h/charptr.c) then extra caution is required
- because of the actions of zzd_attr(). For C++ users this should be
- implemented in the class copy constructor. The version of pccts/h/charptr.*
- distributed with pccts does not use C++ features. See the next item for
- more information.
- Page 27
-
- (Item 84)
- ##. The pccts/h/charptr.c routines use a pointer to a string. The string
- itself will go out of scope when the rule or sub-rule is exited. Why ?
- The string is copied to the heap when ANTLR calls the routine zzcr_attr()
- supplied by charptr.c - however ANTLR also calls the charptr.c supplied
- routine zzd_attr() (which frees the allocated string) as soon as the rule or
- sub-rule exits. The result is that in order to pass charptr.c strings to
- outer rules (for instance to $0) it is necessary to make an independent
- copy of the string using strdup or else zero the pointer to prevent its
- deallocation.
- (Item 85)
- ##. To initialize $0 of a sub-rule use a construct like the following:
-
- *** Note: This feature has been removed from version 1.30 of pccts. ***
-
- decl : typeID
- Var <<$2.type = $1;>>
- ( "," Var <<$2.type = $0;>>)*[$1]
- **** <--------------
-
- See section 4.1.6.1 (page 29) of the 1.00 manual
- (Item 86)
- ##. One can use the zzdef0() macro to define a standard method for
- initializing $0 of a rule or sub-rule. If the macro is defined it is
- invoked as zzdef0(&($0)).
-
- See section 4.1.6.1 (page 29) of the 1.00 manual
-
- I believe that for C++ users this would be handled by the class constructor.
- (Item 87)
- ##. If you construct temporary attributes in the middle of the
- recognition of a rule, remember to deallocate the structure should the
- rule fail. The code for failure goes after the ";" and before the next
- rule. For this reason it is sometimes desirable to defer some processing
- until the rule is recognized rather than the most convenient place.
-
- #include "pccts/h/charptr.h"
-
- ;statement!
- : <<char *label=0;>>
- {ID COLON <<label=MYstrdup($1);>> }
- statement_without_label
- <<#0=#(#[T_statement,label],#2);
- if (label!=0) free(label);
- // AST #1 is undefined
- // AST #2 is returned by
- // statement_without_label
- >>
- ;<<if (label !=0) free(label);>>
-
- In the above example attributes are handled by charptr.*. Readers of this
- note have been warned earlier about its dangers. The routine I have
- written to construct ASTs from attributes (invoked by #[int,char *]) knows
- about this behavior and automatically makes a copy of the character string
- when it constructs the AST. This makes the copy created by the explicit
- call to MYstrdup redundant once the AST has been constructed. If the call
- to "statement_without_label" fails then the temporary copy must be
- deallocated.
-
- Page 28
- ===============================================================================
- Section on ASTs
- -------------------------------------------------------------------------------
- (Item 88)
- ##. With version 1.30 one will no longer have to refer to attributes or
- ASTs of a rule using numbers:
-
- prior to version 1.30:
- rule ! : x y z <<#0=#(#1 #2 #3);>>
-
- with version 1.30:
- rule ! : xx:x yy:y zz:z <<#0=#(#xx,#yy,#zz);>>
-
- Many of the examples in this section need to be revised to reflect the
- use of symbolic tags.
- (Item 89)
- ##. If you define a zzcr_ast() or zzmk_ast() which allocates resources
- such as strings from the heap don't forget to define a zzd_ast() routine
- to release the resources when the AST is deleted. For C++ users this
- should be implemented as part of the class destructor.
- (Item 90)
- ##. Don't confuse #[...] with #(...).
-
- The first creates a single AST node (usually from a token identifier and
- an attribute) using the routine zzmk_ast(). The zzmk_ast() routine must be
- supplied by the user (or selected from one of the pccts supplied ones such
- as pccts/h/charbuf.h, pccts/h/charptr.*, and pccts/h/int.h).
-
- The second creates an AST list (usually more than a single node) from other
- ASTs by filling in the "down" field of the first node in the list to create
- a root node, and the "sibling" fields of each of remaining ASTs in the
- list. A null pointer is put in the sibling field of the last AST in the
- list. This is performed by the pccts supplied routine zztmake().
-
- #token ID "[a-z]*"
- #token COLON ":"
- #token STMT_WITH_LABEL
-
- id! : ID <<#0=#[STMT_WITH_LABEL,$1];>> /* a1 */
-
- Creates an AST. The AST (a single node)
- contains STMT_WITH_LABEL in the token
- field - given a traditional version of
- zzmk_ast().
-
- rule! : id COLON expr /* a2 */
- <<#0=#(#1,#3);>>
-
- Creates an AST list with the ID at its
- root and "expr" as its first (and only) child.
-
- The following example (a3) is equivalent to a1, but more confusing because
- the two steps above have been combined into a single action statement:
-
- rule! : ID COLON expr
- <<#0=#(#[STMT_WITH_LABEL,$1],#3);>> /* a3 */
- Page 29
-
- (Item 91)
- ##. If you construct temporary ASTs in the middle of the recognition of a
- rule, remember to deallocate the structure should the rule fail. The code
- for failure goes after the ";" and before the next rule. For this reason
- it is sometimes desirable to defer some processing until the rule is
- recognized rather than the most appropriate place. For C++ users this
- might be implemented as part of the class destructor.
-
- If the temporary is an AST returned by a called rule then you'll probably
- have to call zzfree_ast() to release the entire AST tree. Consider
- the following example:
-
- obj_name! /* a1 */
- : <<AST *node=0;>> /* a2 */
- class_name <<node=#1;>> /* a3 */
- ( /* a4 */
- () /* empty */ /* a5 */
- <<#0=node;node=0;>> /* a6 */
- | COLON_COLON follows_dot_class[node] /* a7 */
- <<#0=#2;node=0;>> /* a8 */
- ) /* a9 */
- ......... /* a10 */
- /* a11 */
- ; <<if (node!=0) zzfree_ast(node);>> /* a12 */
-
- In this case "class_name" may return a full AST tree (not a trivial tree)
- because of information required to represent template classes (e.g.
- dictionary<int,1000> is a "class_name"). This tree ("node") is passed to
- another rule ("follows_dot_class") which uses it to construct another AST
- tree which incorporates it. If "follows_dot_class" succeeds then node is
- set to 0 (lines a6 or a8) because the tree is now referenced via #2. If
- "follows_dot_class" fails then the entire tree created by class_name must
- be deallocated (line a12). The temporary "node" must be used because there
- is no convenient way (such as #1.1) to refer to class_name from within the
- sub-rule.
-
- Please note the use of an empty sub-rule ("()" on line a5) to avoid the nasty
- init-action problem mentioned earlier.
- (Item 92)
- ##. Example 6 shows debugging code to help locate ASTs that were created
- but never deleted.
- (Item 93)
- ##. If you want to place prototypes for routines that have an AST
- as an argument in the #header directive you should explicitly
- #include "ast.h" after the #define AST_FIELDS and before any references
- to AST:
-
- #define AST_FIELDS int token;char *text;
- #include "ast.h"
- #define zzcr_ast(ast,attr,tok,astText) \
- create_ast(ast,attr,tok,text)
- void create_ast (AST *ast,Attr *attr,int tok,char *text);
- Page 30
-
- (Item 94)
- ##. The make-a-root operator for ASTs ("^") can be applied only to
- terminals. (This includes items identified in #token ,#tokclass, and
- #tokdef statements). I think this is because a child rule might return a
- tree rather than a single AST. If it did then it could not be made into a
- root as it is already a root and the corresponding fields of the structure
- are already in use. To make an AST returned by a called rule a root use
- the expression: #(root-rule sibling1 sibling2 sibling3).
-
- add ! : expr ("\+"^ expr) ; // Is ok
-
- addOperator ! : expr (AddOp expr) // Is NOT ok
- addOp : "\+" | "-"; //
-
- Example 2 describes a workaround for this restriction.
- (Item 95)
- ##. Because it is not possible to use an already constructed AST tree
- as the root of a new tree (unless it's a trivial tree with no children)
- one should be suspicious of any constructs like the following:
-
- rule! : ........ <<#0=#(#1,...)...;>>
- ** <=====================
-
- If #1 is a non-trivial tree its existing children will be lost when the
- new tree is constructed for assignment to #0.
- (Item 96)
- ##. Do not assign to #0 of a rule unless automatic construction of ASTs
- has been disabled using the "!" operator:
-
- a! : x y z <<#0=#(#1,#2,#3);>> // ok
- a : x y z <<#0=#(#1,#2,#3);>> // NOT ok
-
- The reason for the restriction is that assignment to #0 will cause any
- ASTs pointed to by #0 to be lost when the pointer is overwritten.
-
- The stated restriction is somewhat stronger than necessary. You can
- assign to #0 even when using automated AST construction, if the old
- tree pointed to by #0 is part of the new tree constructed by #(...).
- For example:
-
- #token COMMA ","
- #token STMT_LIST
-
- stmt_list: stmt (COMMA stmt)* <<#0=#(#[STMT_LIST],#0);>>
-
- The automatically constructed tree pointed to by #0 is just put at the
- end of the new list, so nothing is lost.
-
- If you reassign to #0 in the middle of the rule, automatic tree
- construction will result in the addition of remaining elements at the end
- of the new tree. This is not recommended by TJP.
-
- Special care must be used when combining the make-a-root operator
- (e.g. rule: expr OP^ expr) with this transgression (assignment to #0 when
- automatic tree construction is selected).
- Page 31
-
- (Item 97)
- ##. Even when automatic construction of ASTs is turned off in a rule the
- called rules still return the ASTs that they constructed. The same applies
- when the "!" operator is applied to a called rule. This is hard to
- believe when one sees a rule like the following:
-
- rule: a! b! c!
-
- generate (in part) a sequence of operations like:
-
- _ast = NULL; a(&_ast);
- _ast = NULL; b(&_ast);
- _ast = NULL; c(&_ast);
-
- It appears that the AST pointer is being assigned to a temporary where it
- becomes inaccessible. This is not the case at all. The called rule is
- responsible for placing a pointer to the AST which is constructed onto a
- stack of AST pointers. The stack of AST pointers is normally in global
- scope with ZZAST_STACKSIZE elements.
-
- (The "!" operator simply inhibits the automatic construction of the
- AST trees. It does not prevent the construction of the ASTs themselves.
- When calling a rule which constructs ASTs and not using the result one
- must destroy the constructed AST using zzfree_ast() in order to avoid a
- memory leak. See Example 6 below for code which aids in tracking lost
- ASTs).
-
- Consider the following examples (using the list notation of page 45 of
- the 1.00 manual):
-
- a: A;
- b: B;
- c: C;
-
- #token T_abc_node
-
- rule : a b c ; <<;>> /* AST list (0 A B C) without root */
- rule ! : a b c <<#0=#(0,#1,#2,#3);>> /* AST list (0 A B C) without root */
- rule : a! b! c! <<#0=#(0,#1,#2,#3);>> /* AST list (0 A B C) without root */
- rule : a^ b c /* AST tree (A B C) with root A */
- rule ! : a b c <<#0=#(#1,#2,#3);>> /* AST tree (A B C) with root A */
-
- rule ! : a b c <<#0=#(#[T_abc_node,0],#1,#2,#3);>>
- /* AST tree (T_abc_node A B C) */
- /* with root T_abc_node */
- rule : a b c <<#0=#(#[T_abc_node,0],#0);>> /* the same as above */
- rule : a! b! c! <<#0=#(#[T_abc_node,0],#1,#2,#3);>> /* the same as above */
-
- rule ! : a b c <<#0=#(toAST(T_abc_node),#1,#2,#3);>> /* the same as above */
- rule : a b c <<#0=#(toAST(T_abc_node),#0);>> /* the same as above */
- rule : a! b! c! <<#0=#(toAST(T_abc_node),#1,#2,#3);>> /* the same as above */
-
- The routine "toAST()" calls zzmk_ast() to construct an AST given the token
- number. For a typical version of zzmk_ast() it would look something like the
- following:
-
- AST * toAST (int tokenID) {
- return zzmk_ast (zzastnew(),tokenID,NULL);
- }
-
- Page 32
- I find toAST() more convenient than passing the extra arguments to zzmk_ast()
- using a construct like #[T_abc_node,0] or writing zzmk_ast() with varargs.
- Using varargs defeats most forms of inter-procedural type checking (unless you
- are using C++ which allows overloaded function names).
- (Item 98)
- ##. There is an idiom which can be useful when combining automatic AST
- construction with optional clauses in a grammar. Suppose one wants to
- make the following transformation:
-
- rule : lhs => #(toAST(T_simple),#1)
- rule : lhs rhs => #(toAST(T_complex),#1,#2)
-
- Both lhs and rhs considered separately may be suitable for automatic
- construction of ASTs, but the change in the label from "simple" to "complex"
- appears to require manual tree construction. Use the following idiom:
-
- rule : lhs (
- () <<#0=#(toAST(T_simple),#0);>>
- | rhs <<#0=#(toAST(T_complex),#0,#1);>>
- )
- (Item 99)
- ##. If you use ASTs you have to pass a root AST to ANTLR.
-
- AST *root=NULL;
- again:
- ANTLR (start(&root),stdin);
- walk_the_tree(root);
- zzfree_ast(root);
- root=NULL;
- goto again;
- (Item 100)
- ##. zzfree_ast(AST *tree) will recursively descend the AST tree and free
- all sub-trees. The user should supply a routine zzd_ast() to free any
- resources used by a single node - such as pointers to character strings
- allocated on the heap. See Example 2 on associativity and precedence.
- (Item 101)
- ##. AST elements in rules are assigned numbers in the same fashion as
- attributes with three exceptions:
-
- 1. A hole is left in the sequence when sub-rules are encountered.
- (e.g. "(...)+", "(...)*", and "{...}").
- 2. #0 is the AST of the named rule, not the sub-rule - see the next item
- 3. There is nothing analogous to $i.j notation (which allows one
- to refer to attributes from earlier in the rule). In other words,
- you can't use #i.j notation to refer to an AST created earlier
- in the rule.
-
- ========================================================
- Version 1.30 of Antlr allows one to use symbolic tags
- rather than numbers to refer to matched elements of a rule.
- They are similar in appearance to Sorcerer.
- See the version 1.3 release notes for more information
- ========================================================
-
- Consider the following example:
-
- a : b // B is #1 for the rule
- (c d)* // C is #1 when scope is inside the sub-rule
- // D is #2 when scope is inside the sub-rule
- // You may *NOT* refer to b as #1.1
- e // E is #3 for the rule
- // There is NO #2 for the rule
- Page 33
-
- (Item 102)
- ##. The expression #0 refers to the AST of the named rule. Thus it is
- a misnomer and (for consistentcy) should probably have been named ## or #$.
- There is nothing equivalent to $0 for ASTs. This is probably because
- sub-rules aren't assigned AST numbers in a rule.
- (Item 103)
- ##. Associativity and precedence of operations is determined by nesting
- of rules. In the example below "=" associates to the right and has the
- lowest precedence. Operators "+" and "*" associate to the left with "*"
- having the highest precedence.
-
- expr0 : expr1 {"=" expr0};
- expr1 : expr2 ("\+" expr2)*;
- expr2 : expr3 ("\*" expr3)*;
- expr3 : ID;
-
- In Example 2 the zzpre_ast() routine is used to walk all the AST nodes.
- The AST nodes are numbered during creation so that one can see the order in
- which they are created and the order in which they are deleted. Do not
- confuse the "#" in the sample output with the AST numbers used to refer to
- elements of a rule in the action part of a the rule. The "#" in the
- sample output are just to make it simpler to match elements of the
- expression tree with the order in which zzd_ast() is called for each node in
- the tree.
- (Item 104)
- ##. If the make-a-root operator were NOT used in the rules:
-
- ;expr0 : expr1 {"=" expr0}
- ;expr1 : expr2 ("\+" expr2)*
- ;expr2 : expr3 ("\*" expr3)*
- ;expr3 : ID
-
- With input:
-
- a+b*c
-
- The output would be:
-
- a <#1> \+ <#2> b <#3> \* <#4> c <#5> NEWLINE <#6>
-
- zzd_ast called for <node #6>
- zzd_ast called for <node #5>
- zzd_ast called for <node #4>
- zzd_ast called for <node #3>
- zzd_ast called for <node #2>
- zzd_ast called for <node #1>
- Page 34
-
- (Item 105)
- ##. Suppose that one wanted to replace the terminal "+" with the rule:
-
- addOp : "\+" | "-" ;
-
- Then one would be unable to use the "make-a-root" operator because it can
- be applied only to terminals.
-
- There are two workarounds. The #tokclass feature allows one to write:
-
- #tokclass AddOp { "\+" "\-"}
-
- A #tokclass identifier may be used in a rule wherever a simple #token
- identifier may be used.
-
- The other workaround is much more complicated:
-
- expr : (expr0 NEWLINE)
- ;expr0 : expr1 {"="^ expr0}
- ;expr1! : expr2 <<#0=#1;>>
- (addOp expr2 <<#0=#(#1,#0,#2);>> )*
- ;expr2 : expr3 ("\*"^ expr3)*
- ;expr3 : ID
- ;addOp : "\+" | "\-"
-
- With input:
-
- a-b-c
-
- The output is:
-
- ( \- <#4> ( \- <#2> a <#1> b <#3> ) c <#5> ) NEWLINE <#6>
-
- The "!" for rule "expr1" disables automatic constructions of ASTs in the
- rule. This allows one to manipulate #0 manually. If the expression had
- no addition operator then the sub-rule "(addOp expr)*" would not be
- executed and #0 will be assigned the AST constructed by rule expr2 (i.e.
- AST #1). However if there is an addOp present then each time the sub-rule
- is rescanned due to the "(...)*" the current tree in #0 is placed as the
- first of two siblings underneath a new tree. This new tree has the AST
- returned by addOp (AST #1 of the addOp sub-rule) as the root.
- (Item 106)
- ##. There is an option for doubly linked ASTs in the module ast.c. It is
- controlled by #define zzAST_DOUBLE. Even with zzAST_DOUBLE only the right
- and down fields are filled while the AST tree is constructed. Once the tree
- is constructed the user must call the routine zzdouble_link(tree,NULL,NULL) to
- traverse the tree and fill in the left and up fields. See page 12 of the
- 1.06 manual for more information.
- (Item 107)
- ##. If a rule which creates an AST is called and the result is not
- linked into the tree being constructed then zzd_ast() will not be called
- to release the resources used by the rule. Prior to version 1.20
- this was especially important when rules were used in syntactic predicates.
- Versions >= 1.20 bypasses construction of all ASTs during guess mode.
-
- Page 35
- ===============================================================================
- Section on Semantic Predicates
- -------------------------------------------------------------------------------
- (Item 108)
- ##. There is a bug in 1.1x and 1.2x which prevents semantic predicates
- from including string literals. The predicate is incorrectly
- "string-ized" in the call to zzfailed_predicate.
-
- rule: <<containsCharacter("!@#$%^&*",LATEXT(1))>>? ID
- /* Will not work */
-
- The workaround is to place the literal in a string constant and use
- the variable name.
- (Item 109)
- ##. There is a bug in 1.1x and 1.2x which prevents semantic predicates from
- crossing lines unless one uses an escaped newline.
-
- rule: <<do_test();\ /*** Note escaped newline ***/
- this_works_in_120)>>? x y z;
- (Item 110)
- ##. Semantic predicates are enclosed in "<<... >>?" but because they are
- inside "if" statements they normally do not end with a ";" - unlike other
- code enclosed in "<<...>>" in ANTLR.
- (Item 111)
- ##. If one leaves an extra space after the close of the action:
-
- <<...>> ? instead of <<...>>?
-
- then ANTLR won't recognize it as a semantic predicate.
- (Item 112)
- ##. Init-actions are ignored as far as the hoisting of semantic predicates
- is concerned.
- Page 36
-
- (Item 113)
- ##. Semantic predicates which are not the first element in the rule or
- sub-rule become "validation predicates" and are not used for prediction.
- After all, if there are no alternatives, then there is no need for
- prediction - and alternatives exist only at the left edge of rules
- and sub-rules. Even if the semantic predicates are on the left edge it
- is no guarantee that it will be part of the prediction expression.
- Consider the following two examples:
-
- a : << LA(1)==ID ? propX(LATEXT(1)) : 1 >>? ID glob /* a1 */
- | ID glob /* a2 */
- ;
- b : << LA(1)==ID ? propX(LATEXT(1)) : 1 >>? ID glob /* b1 */
- | NUMBER glob /* b2 */
- ;
-
- Rule a requires the semantic predicate to disambiguate alternatives
- a1 and a2 because the rules are otherwise identical. Rule b has a
- token type of NUMBER in alternative b2 so it can be distinguished from
- b1 without evaluation of the semantic predicate during prediction. In
- both cases the semantic predicate will also be evaluated inside the rule.
-
- When the tokens which can follow a rule allow ANTLR to disambiguate the
- expression without resort to semantic predicates ANTLR may not evaluate
- the semantic predicate in the prediction code. For example:
-
- simple_func : <<LA(1)==ID ? isSimpleFunc(LATEXT(1)) : 1>>? ID
- complex_func : <<LA(1)==ID ? isComplexFunc(LATEXT(1)) : 1>>? ID
-
- function_call : "(" ")"
-
- func : simple_func function_call
- | complex_func "." ID function_call
-
- In this case, a "simple_func" MUST be followed by a "(", and a
- "complex_func" MUST be followed by a ".", so it is unnecessary to evaluate
- the semantic predicates in order to predict which of the alternative to
- use. A simple test of the lookahead tokens is sufficient. As stated
- before, the semantic predicates will still be used to validate the rule.
- Page 37
-
- (Item 114)
- ##. Suppose that the requirement that all semantic predicates which are
- used in prediction expressions must appear at the left hand edge of a rule
- were lifted? Consider the following code segment:
-
- cast_expr /* a1 */
- : LP typedef RP cast_expr /* a2 */
- | expr13 /* a3 */
- ;expr13 /* a4 */
- : id_name /* a5 */
- | LP cast_expr RP /* a6 */
- ;typedef /* a7 */
- : <<LA(1)==ID ? isTypedefName(LATEXT(1)) : 1 >>? ID /* a8 */
- ;id_name /* a9 */
- : ID /* a10 */
-
- Now consider the token sequences:
-
- Token: #1 #2 #3 #4
- -- ----------------------- -- --
- "(" ID-which-is-typedef ")" ID
- "(" ID-which-is-NOT-typedef ")"
-
- Were the semantic predicate at line a8 hoisted to predict which alternative
- of cast_expr to use (a2 or a3) the program would use the wrong lookahead
- token (LA(1) and LATEXT(1)) rather than LA(2) and LATEXT(2) to check for an
- ID which satisfies "isTypedefName()". This is because it is preceded by a
- "(". This problem could perhaps be solved by application of sufficient
- ingenuity, however, in the meantime the solution is to rewrite the rules
- so as to move the decision point to the left edge of the production.
-
- First perform in-line expansion of expr13 (line a3) in cast_expr:
-
- cast_expr /* b1 */
- : LP typedef RP cast_expr /* b2 */
- | id_name /* b3 */
- | LP cast_expr RP /* b4 */
-
- Secondly, move the alternatives (in cast_expr) beginning with LP to a
- separate rule so that "typedef" and "cast_expr" will be on the left edge:
-
- cast_expr /* c1 */
- : LP cast_expr_suffix /* c2 */
- | id_name /* c3 */
- ;cast_expr_suffix /* c4 */
- : typedef RP cast_expr /* c5 */
- | cast_expr RP /* c6 */
- ;typedef /* c7 */
- : <<LA(1)==ID ? isTypedefName(LATEXT(1)) : 1 >>? ID /* c8 */
- ;id_name /* c9 */
- : ID /* c10 */
-
- This will result in the desired treatment of the semantic predicate to
- choose from alternatives c5 and c6.
- Page 38
-
- (Item 115)
- ##. Validation predicates are evaluated by the parser. If they fail a
- call to zzfailed_predicate(string) is made. To disable the message
- redefine the macro zzfailed_predicate(string) or use the optional
- "failed predicate" action which is enclosed in "[" and "]" and follows
- immediately after the predicate:
-
- a : <<LA(1)==ID ?
- isTypedef(LATEXT(1)) : 1>>?[printf("Not a typedef\n");]
-
- Douglas Cuthbertson (Douglas_Cuthbertson.JTIDS@jtids_qmail.hanscom.af.mil)
- has pointed out that Antlr fails to put the fail action inside "{...}"
- which can lead to problems when the action contains multiple statements.
- (Item 116)
- ##. An expression in a semantic predicate (e.g. <<isFunc()>>? ) should not
- have side-effects. If there is no match then the rest of the rule using the
- semantic predicate won't be executed.
- Page 39
-
- (Item 117)
- ##. What is the "context" of a semantic predicate ? Answer due to TJP:
-
- The context of a predicate is the set of k-strings (comprised of lookahead
- symbols) that can be matched following the execution of a predicate. For
- example,
-
- a : <<p>>? alpha ;
-
- The context of "p" is LOOK(alpha) where LOOK(alpha) is the set of
- lookahead k-strings for alpha.
-
- Normally, one should compute the context for ANTLR (manually) because
- ANTLR is not smart enough to know the nature of your predicate and does not
- know how much context information is needed; it's conservative and tries
- to compute full LL(k) lookahead. Normally, you only need one token:
-
- class_name: <<isClass(LATEXT(1))>>? ID ;
-
- This example is incomplete, the predicate should really be:
-
- class_name: <<LA(1)==ID ? isClass(LATEXT(1)) : 1>>? ID ;
-
- This says, "I can tell you something if you have an ID, otherwise
- just assume that the rule is semantically valid." This only makes a
- difference if the predicate is *hoisted* out of the rule. Here is an
- example that won't work because it doesn't have context check in the
- predicates:
-
- a : ( class_name | NUM )
- | type_name
- ;
-
- class_name : <<isClass(LATEXT(1))>>? ID ;
-
- type_name : <<isType(LATEXT(1))>>? ID ;
-
- The prediction for production one of rule "a" will be:
-
- if ( LA(1) in { ID, NUM } && isClass(LATEXT(1)) ) { ...
-
- Clearly, NUM will never satisfy isClass(), so the production will never
- match.
-
- When you ask ANTLR to compute context, it can check for missing predicates.
- With -prc on, for this grammar:
-
- a : b
- | <<isVar(LATEXT(1))>>? ID
- | <<isPositive(LATEXT(1))>>? NUM
- ;
-
- b : <<isType(LATEXT(1))>>? ID
- | NUM
- ;
-
- ANTLR reports:
-
- warning alt 1 of rule itself has no predicate to resolve
- ambiguity upon \{ NUM \}
- Page 40
-
- (Item 118)
- ##. A documented restriction of ANTLR is the inability to hoist multiple
- semantic predicates. However, no error message is given when one attempts
- this. When compiled with k=1 and ck=2 this generates inappropriate code
- in "statement" when attempting to predict "expr":
-
- #header <<
-
- #include "charbuf.h"
-
- int istypedefName (char *);
- int isCommand (char *);
-
- >>
-
- #token BARK
- #token GROWL
- #token ID
-
- statement
- : expr
- | declaration
- ;expr
- : commandName BARK
- | typedefName GROWL
- ;declaration
- : typedefName BARK
- ;typedefName
- : <<LA(1)==ID ? istypedefName(LATEXT(1)) : 1>>? ID
- ;commandName
- : <<LA(1)==ID ? isCommand(LATEXT(1)) : 1>>? ID
- ;
-
- The generated code resembles the following:
-
- void statement()
- {
- if ( (LA(1)==ID) &&
- (LA(2)==BARK || LA(2)==GROWL) &&
- ( (LA(1)==ID ? isCommand(LATEXT(1)) : 1) ||
- (LA(1)==ID ? istypedefName(LATEXT(1)) : 1)) ) {
- expr();
- } else {
- if ( (LA(1)==ID) &&
- (LA(2)==BARK) &&
- (LA(1)==ID ? istypedefName(LATEXT(1)) : 1)) ) {
- declaration();
- } ...
-
- The problem is that "<typdefname> BARK" will be passed to expr() rather
- than declaration().
-
- Some help is obtained by using leading actions to inhibit hoisting as
- described in the next notes. (Don't confuse leading actions with
- init-actions.) However, omitting all semantic predicates in the prediction
- expression doesn't help if one requires them to predict the rule.
- Page 41
-
- (Item 119)
- ##. Leading actions will inhibit the hoisting of semantic predicates into
- the prediction of rules.
-
- expr_rhs
- : <<;>> <<>> expr0
- | command
-
- See the section about known bugs for a more complete example.
- (Item 120)
- ##. When using semantic predicates in ANTLR is is *IMPORTANT* to
- understand what the "-prc on" ("predicate context computation")
- option does and what "-prc off" doesn't do. Consider the following
- example:
-
- +------------------------------------------------------+
- | Note: All examples in this sub-section are based on |
- | code generated with -k=1 and -ck=1. |
- +------------------------------------------------------+
-
- expr : upper
- | lower
- | number
- ;
-
- upper : <<isU(LATEXT(1))>>? ID ;
- lower : <<isL(LATEXT(1))>>? ID ;
- number : NUMBER ;
-
- With "-prc on" ("-prc off" is the default) the code for expr() to predict
- upper() would resemble:
-
- if (LA(1)==ID && isU(LATEXT(1)) && LA(1)==ID) { /* a1 */
- upper(zzSTR); /* a2 */
- } /* a3 */
- else { /* a4 */
- if (LA(1)==ID && isL(LATEXT(1)) && LA(1)==ID) { /* a5 */
- lower(zzSTR); /* a6 */
- } /* a7 */
- else { /* a8 */
- if (LA(1)==NUMBER) { /* a9 */
- zzmatch(NUMBER); /* a10 */
- } /* a11 */
- else /* a12 */
- {zzFAIL();goto fail;} /* a13 */
- } /* a14 */
- } ...
- ...
-
- *******************************************************
- *** ***
- *** Starting with version 1.20: ***
- *** Predicate tests appear AFTER lookahead tests ***
- *** ***
- *******************************************************
-
- Note that each test of LATEXT(i) is guarded by a test of the token type
- (e.g. "LA(1)==ID && isU(LATEXT(1)").
-
- Page 42
- With "-prc off" the code would resemble:
-
- if (isU(LATEXT(1)) && LA(1)==ID) { /* b1 */
- upper(zzSTR); /* b2 */
- } /* b3 */
- else { /* b4 */
- if (isL(LATEXT(1)) && LA(1)==ID) { /* b5 */
- lower(zzSTR); /* b6 */
- } /* b7 */
- else { /* b8 */
- if ( (LA(1)==NUMBER) ) { /* b9 */
- zzmatch(NUMBER); /* b10 */
- } /* b11 */
- else /* b12 */
- {zzFAIL();goto fail;} /* b13 */
- } /* b14 */
- } ...
- ...
-
- Thus when coding the grammar for use with "-prc off" it is necessary
- to do something like:
-
- upper : <<LA(1)==ID && isU(LATEXT(1))>>? ID ;
- lower : <<LA(1)==ID && isL(LATEXT(1))>>? ID ;
-
- This will make sure that if the token is of type NUMBER that it is not
- passed to isU() or isL() when using "-prc off".
-
- So, you say to yourself, "-prc on" is good and "-prc off" is bad. Wrong.
-
- Consider the following slightly more complicated example in which the
- first alternative of rule "expr" contains tokens of two different types:
-
- expr : ( upper | NUMBER ) NUMBER
- | lower
- | ID
- ;
-
- upper : <<LA(1)==ID && isU(LATEXT(1))>>? ID ;
- lower : <<LA(1)==ID && isL(LATEXT(1))>>? ID ;
- number : NUMBER ;
-
- With "-prc off" the code would resemble:
-
- ...
- { /* c1 */
- if (LA(1)==ID && isU(LATEXT(1)) && /* c2 */
- ( LA(1)==ID || LA(1)==NUMBER) ) { /* c3 */
- { /* c4 */
- if (LA(1)==ID) { /* c5 */
- upper(zzSTR); /* c6 */
- } /* c7 */
- else { /* c8 */
- if (LA(1)==NUMBER) { /* c9 */
- zzmatch(NUMBER); /* c10 */
- } /* c11 */
- else {zzFAIL();goto fail;}/* c12 */
- } /* c13 */
- } ...
- ...
-
- Page 43
- Note that if the token is a NUMBER (i.e. LA(1)==NUMBER) then the clause at
- line c2 ("LA(1)==ID && ...") will always be false, which implies that the
- test in the "if" statement (lines c2/c3) will always be false. (In other
- words LA(1)==NUMBER implies LA(1)!=ID). Thus the sub-rule for NUMBER at
- line c9 can never be reached.
-
- With "-prc on" essentially the same code is generated, although it
- is not necessary to manually code a test for token type ID preceding
- the call to "isU()".
-
- The workaround is to to bypass the heart of the predicate when
- testing the wrong type of token.
-
- upper : <<LA(1)==ID ? isU(LATEXT(1)) : 1>>? ID ;
- lower : <<LA(1)==ID ? isL(LATEXT(1)) : 1>>? ID ;
-
- Then with "-prc off" the code would resemble:
- ...
- { /* d1 */
- if ( (LA(1)==ID ? isU(LATEXT(1)) : 1) && /* d2 */
- (LA(1)==ID || LA(1)==NUMBER) ) { /* d3 */
- ...
- ...
-
- With this correction the body of the "if" statement is now reachable
- even if the token type is NUMBER - the "if" statement does what one
- wants.
-
- With "-prc on" the code would resemble:
-
- ... /* e1 */
- if (LA(1)==ID && /* e2 */
- (LA(1)==ID ? isU(LATEXT(1)) : 1) && /* e3 */
- (LA(1)==ID || LA(1)==NUMBER) ) { /* e4 */
- ...
- ...
-
- Note that the problem of the unreachable "if" statement body has
- reappeared because of the redundant test ("e2") added by the predicate
- computation.
-
- The lesson seems to be: when using rules which have alternatives which
- are "visible" to ANTLR (within the lookahead distance) that have different
- token types it is probably dangerous to use "-prc on".
- Page 44
-
- (Item 121)
- ##. You cannot use downward inheritance to pass parameters
- to semantic predicates which are NOT validation predicates. The
- problem appears when the semantic predicate is hoisted into a
- parent rule to predict which rule to call:
-
- For instance:
-
- a : b1 [flag]
- | b2
- | b3
-
- b1 [int flag]
- : <<LA(1)==ID && flag && hasPropertyABC (LATEXT(1))>>? ID ;
-
- b2 :
- : <<LA(1)==ID && hasPropertyXYZ (LATEXT(1))>>? ID ;
-
- b3 : ID ;
-
- When the semantic predicate is evaluated within rule "a" to determine
- whether to call b1, b2, or b3 the compiler will discover that there
- is no variable named "flag" for procedure "a()". If you are unlucky
- enough to have a variable named "flag" in a() then you will have a
- VERY difficult-to-find bug.
-
- The -prc option has no effect on this behavior.
-
- It is possible that a leading action (init-actions are ignored for purposes
- of hoisting) will inhibit the hoisting of the predicate and make this code
- work. I have not verified this with versions 1.2x.
- (Item 122)
- ##. Another reason why semantic predicates must not have side effects is
- that when they are hoisted into a parent rule in order to decide which
- rule to call they will be invoked twice: once as part of the prediction
- and a second time as part of the validation of the rule.
-
- Consider the example above of upper and lower. When the input does
- in fact match "upper" the routine isU() will be called twice: once inside
- expr() to help predict which rule to call, and a second time in upper() to
- validate the prediction. If the second test fails the macro zzpred_fail()
- is called.
-
- As far as I can tell, there is no simple way to disable the use of a
- semantic predicate for validation after it has been used for prediction.
- Page 45
-
- (Item 123)
- ##. I had a problem in which I needed to do a limited amount of
- lookahead, but didn't want to use all the machinery of syntactic
- predicates. I found that I could enlarge the set of expressions accepted
- by "expr" and then look at the AST created in order to determined what
- rules could follow:
-
- cast_expr /* a1 */
- : <<int isCast=0;>> /* a2 */
- /* a3 */
- LP! predefined_type RP! cast_expr /* a4 */
- <<#0=#(toAST(T_cast),#0);>> /* a5 */
- | LP! expr0 RP! /* a6 */
- <<if ((#2->token)==T_class_name) { /* a7 */
- isCast=1; /* a8 */
- } else { /* a9 */
- isCast=0; /* a10 */
- }; /* a11 */
- >> /* a12 */
- ( <<;>> <<isCast==1>>? /* a13 */
- <<printf ("\nIs cast expr\n");>> /* a14 */
- cast_expr /* a15 */
- <<#0=#(toAST(T_cast),#0);>> /* a16 */
- /* a17 */
- | <<printf ("\nIs NOT cast expr\n");>> /* a18 */
- () /* empty */ /* a19 */
- ) /* a20 */
- | unary_expr /* a21 */
-
- Later on I gave up on this approach and decided to use syntactic
- predicates anyway. It not only solved this problem, but others
- where it was more difficult to patch up the grammar. I can't bring
- myself to remove the example, though.
-
- Page 46
- ===============================================================================
- Section on Syntactic Predicates (also known as "Guess Mode")
- -------------------------------------------------------------------------------
- (Item 124)
- ##. The terms "infinite lookahead", "guess mode","syntactic predicates"
- are all equivalent. Sometimes the term "backtracking" is used as well,
- although " backtracking" can sometimes be used to discuss lexing and DLG
- as well. The term "syntactic predicate" emphasizes that is handled by the
- parser. The term "guess mode" emphasizes that the parser may have to
- backtrack. The term "infinite lookahead" emphasizes the implementation in
- ANTLR: the entire input is read, processed, and tokenized by DLG before
- ANTLR begins parsing.
- (Item 125)
- ##. An expression in a syntactic predicate should not have side-effects.
- If there is no match then the rule which uses the syntactic predicate won't be
- executed.
- (Item 126)
- ##. In some extremely unusual cases a user wants side-effects during guess
- mode. In this case one can use exploit the fact that Antlr always
- executes init-actions, even when in guess mode:
-
- rule : (guess)? A
- | B
- ;
- guess : <<regular-init-action-that's-always-executed>>
- A ( <<init-action-for-empty-subrule>> ) B
- ;
-
- The init-action in the sub-rule will always be executed, even in guess-mode.
- Contributed by TJP.
- (Item 127)
- ##. When using syntactic predicates the entire input buffer is read and
- tokenized by DLG before parsing by ANTLR begins. If a "wrong" guess
- requires that parsing be rewound to an earlier point all attributes
- that were creating during the "guess" are destroyed and the parsing
- begins again and it creates new attributes at it reparses the (previously)
- tokenized input.
- (Item 128)
- ##. In infinite lookahead mode the line and column information is
- hopelessly out-of-sync because zzline will contain the line number of
- the last line of input - the entire input was parsed before
- scanning was begun. The line and column information is not restored
- during backtracking. To keep track of the line information in a meaningful
- way one has to use the ZZINF_LINE macro which was added to pccts in version
- 1.20.
-
- Putting line and column information in a field of the attribute will not
- help. The attributes are created by ANTLR, not DLG, and when ANTLR
- backtracks it destroys any attributes that were created in making the
- incorrect guess.
- (Item 129)
- ##. As infinite lookahead mode causes the entire input to be scanned
- by DLG before ANTLR begins parsing, one cannot depend on feedback from
- the parser to the lexer to handle things like providing special token codes
- for items which are in a symbol table (the "lex hack" for typedefs
- in the C language). Instead one MUST use semantic predicates which allow
- for such decisions to be made by the parser.
- (Item 130)
- ##. One cannot use an interactive scanner (ANTLR -gk option) with the
- ANTLR infinite lookahead and backtracking options (syntactic predicates).
- Page 47
-
- (Item 131)
- ##. An example of the need for syntactic predicates is the case where
- relational expressions involving "<" and ">" are enclosed in angle bracket
- pairs.
-
- Relation: a < b
- Array Index: b <i>
- Problem: a < b<i>
- vs. b < a>
-
- I was going to make this into an extended example, but I haven't had
- time yet.
- (Item 132)
- ##. Version 1.20 fixes a problem in 1.10 in which ASTs were constructed
- during guess mode. In version 1.10 care had to be taken to deallocate the
- ASTs that were created in the rules which were invoked in guess mode.
- (Item 133)
- ##. The following is an example of the use of syntactic predicates.
-
- program : ( s SEMI )* ;
-
- s : ( ID EQUALS )? ID EQUALS e
- | e
- ;
-
- e : t ( PLUS t | MINUS t )* ;
-
- t : f ( TIMES f | DIV f )* ;
-
- f : Num
- | ID
- | "\(" e "\)"
- ;
-
- When compiled with k=1:
-
- antlr -fe err.c -fh stdpccts.h -fl parser.dlg -ft tokens.h \
- -fm mode.h -k 1 test.g
-
- One gets the following warning:
-
- warning: alts 1 and 2 of the rule itself ambiguous upon { ID }
-
- even though the manual suggests that this is okay. The only problem is
- that ANTLR 1.10 should NOT issue this error message unless the -w2 option
- is selected.
-
- Included with permission of S. Salters
-
- Page 48
- ===============================================================================
- Section on Inheritance
- -------------------------------------------------------------------------------
- (Item 134)
- ##. A rule which uses upward inheritance:
-
- rule > [int result] : x | y | z;
-
- Is simply declaring a function which returns an "int" as a function
- value. If the function has more than one item passed via upward
- inheritance then ANTLR creates a structure to hold the result and
- then copies each component of the structure to the upward inheritance
- variables.
- (Item 135)
- ##. When writing a rule that uses downward inheritance:
-
- rule [int *x] : r1 r2 r3
-
- one should remember that the arguments passed via downward inheritance are
- simply arguments to a function. If one is using downward inheritance
- syntax to pass results back to the caller (really upward inheritance !)
- then it is necessary to pass the address of the variable which will receive
- the result.
- (Item 136)
- ##. ANTLR is smart enough to combine the declaration for an AST with
- the items declared via downward inheritance when constructing the
- prototype for a function which uses both ASTs and downward inheritance.
-
- Page 49
- ===============================================================================
- Section on LA, LATEXT, NLA, and NLATEXT
- -------------------------------------------------------------------------------
- (Item 137)
- ##. Do not use LA(i) or LATEXT(i) in the action routines of #token
- statements. To refer to the token code (in a #token action) of the token
- just recognized use "NLA". NLA is an lvalue (can appear on the left hand
- side of an assignment statement). To refer to the text just recognized
- use zzlextext (the entire text), NLATEXT. One can also use
- zzbegexpr/zzendexpr which refer to the regular expression just matched.
- The char array pointed to by zzlextext may be larger than the string
- pointed to by zzbegexpr and zzendexpr because it includes substrings
- accumulated through the use of zzmore().
- (Item 138)
- ##. Extra care must be taken in using LA(i) and LATEXT(i) when in
- interactive mode (Antlr switch -gk) because Antlr doesn't guarantee that
- it will fetch lookahead tokens until absolutely necessary. It is somewhat
- safer to refer to lookahead information in semantic predicates, but care
- is still required. I have summarized the output from Example 7:
-
- -----------------------------------------------------------------------
- k=1 k=1 k=3 k=3 k=3
- standard infinite standard interactive infinite
- -----------------------------------------------------------------------
- for a semantic predicate
- ------------------------
- LA(0) Next Next -- -- --
- LA(1) Next Next Next Next Next
- zzlextext Next Next Next -- Next
- ZZINF_LA(0) Next Next
- ZZINF_LA(1) NextNext NextNext
- -----------------
- for a rule action
- -----------------
- LA(0) Prev Prev -- Prev --
- LA(1) Prev Prev Prev Next Prev
- zzlextext Prev Prev Prev -- Prev
- ZZINF_LA(0) Prev Prev
- ZZINF_LA(1) Next Next
- -----------------------------------------------------------------------
-
- The entries "prev" and "next" means that the left hand item refers to the
- token which precedes (or follows) the action which generated the output.
-
- For semantic predicate entries think of the following rule:
-
- rule : <<semantic-predicate>>? Next NextNext;
-
- For rule-action entries think of the following rule:
-
- rule : Prev <<action>> Next NextNext;
- (Item 139)
- ##. Example 7 below gives some diagnostic output for a k=3 grammar compiled
- with "standard" options, interactive options (AFLAGS=-gk), and infinite
- lookahead option (CFLAGS=-DZZINF_LOOK).
- (Item 140)
- ##. Example 8 shows how to modify the lookahead token NLA.
- Page 50
-
- (Item 141)
- ##. I find it helpful to think of lexical processing by DLG as a process
- which fills a pipeline and of Antlr as a process which empties a pipeline.
- (This relationship is exposed in C++ mode because DLG passes an object of
- a certain class to Antlr).
-
- With LL_K=1 the pipeline is only one item deep and is trivial and pretty much
- invisible. It is invisible because one can make a decision in Antlr which
- affects how the very next token is processed. For instance with LL_K=1 it is
- possible to change the DLG mode in an Antlr action with zzmode() and have
- the next token (the one following the one just parsed by Antlr) parsed
- according to the new #lexclass.
-
- With LL_K>1 the pipeline is not invisible. DLG will put a number of tokens
- into the pipeline and Antlr will analyze them in the same order. How many
- tokens are in the pipeline depends on options one has chosen.
-
- Case 1: If one has infinite lookahead mode ("(...)?") (also known as
- syntactic predicates) then the pipeline is as huge as the input stream
- since the entire input is tokenized by DLG before Antlr even begins
- analysis.
-
- Case 2: If you have demand lookahead (interactive mode) then you'll have a
- varying amount of lookahead depending on how much Antlr thinks it needs to
- parse the thing it is working on. This may be zero (or maybe its 1 token)
- up to k tokens. Naturally it takes extra work by Antlr to keep track of
- how many tokens are in the pipe and how many are needed to parse the next
- rule.
-
- Case 3: In "normal" mode DLG tries to stay exactly k tokens ahead of
- Antlr. This is a half-truth. It rounds k up to the next power of
- 2 so that with k=3 it actually has a pipeline of 4 tokens. If one says
- "k=3" the analysis is still k=3, but the pipeline size is rounded up
- because TJP decided it was better to use a bit-wise "and" then some other
- mechanism to compute (n+1) mod k - where n is the position in a circular
- buffer.
-
- Page 51
- ===============================================================================
- Section on Prototypes
- -------------------------------------------------------------------------------
- (Item 142)
- ##. Prototype for typical create_attr routine:
-
- #define zzcr_attr(attr,token,text) \
- create_attr(attr,token,text)
-
- void create_attr (Attrib *attr,int token,char *text);
- (Item 143)
- ##. Prototype for a typical create_ast routine invoked to automatically
- construct an AST from an attribute:
-
- #define zzcr_ast(ast,attr,tok,astText) \
- create_ast(ast,attr,tok,text)
-
- void create_ast (AST *ast,Attr *attr,int tok,char *text);
- (Item 144)
- ##. Prototype for a typical make_ast routine invoked by the #[...]
- notation.
-
- AST *zzmk_ast (AST *ast,int token,char *text)
- (Item 145)
- ##. Prototype for a typical zzd_ast macro which is invoked when destroying
- an AST node:
-
- #define zzd_ast(node) delete_ast(node)
-
- void delete_ast (AST * node);
- (Item 146)
- ##. Prototype for zzdef0 macro to initialize $0 of a rule:
-
- #define zzdef0(attr) define_attr_0 (attr)
-
- void define_attr_0 (Attrib *attr);
- (Item 147)
- ##. Prototype for ANTLR (these are actually macros):
-
- read from file: void ANTLR (void startRule(...),FILE *)
- read from string: void ANTLRs (void startRule(...),zzchar_t *)
- read from function: void ANTLRf (void startRule(...),int (*)())
- read from file: void ANTLRm
- (void startRule(...),FILE *,int lexclass)
-
- In the call to ANTLRf the function behaves like getchar()
- in that it returns EOF (-1) to indicate end-of-file.
-
- If ASTs are used or there is downward or upward inheritance then the
- call to the startRule must pass these arguments:
-
- AST *root;
- ANTLRf (startRule(&root),stdin);
-
- Page 52
- ===============================================================================
- Section on ANTLR/DLG Internals and Routines That Might Be Useful
- -------------------------------------------------------------------------------
- ****************************
- ****************************
- ** **
- ** Use at your own risk **
- ** **
- ****************************
- ****************************
- (Item 148)
- ##. Sometimes I have wanted to add code which appears before every
- #token action or after every #token action. Rather than modify every
- #token statement one could add code to pccts/h/dlgauto.h near line 430:
-
- (*actions)[accepts[state]]();
-
- This statement is executed for every #token statement. Even #token
- statements without a user-written action contain the required action:
-
- NLA=TokenIdentifier
-
- Following the statement near line 430 of dlgauto.h would be an appropriate
- place to insert debug code to print out token definitions. The name
- for token "i" is in the char * array zztokens[i] (defined in antlr.h).
- (Item 149)
- ##. static int zzauto - defined in dlgauto.h
-
- Current DLG mode. This is used by zzmode() only.
- (Item 150)
- ##. void zzerr (char * s) defined in dlgauto.h
-
- Defaults to zzerrstd(char *s) in dlgauto.h
-
- Unless replaced by a user-written error reporting routine:
-
- fprintf(stderr,
- "%s near line %d (text was '%s')\n",
- ((s == NULL) ? "Lexical error" : s),
- zzline,zzlextext);
-
- This should probably be "void zzerr (const char * s)".
- (Item 151)
- ##. static char zzebuf[70] defined in dlgauto.h
-
- Page 53
- ===============================================================================
- Section on Known Minor Bugs in pccts (in reverse chronological order)
- -------------------------------------------------------------------------------
- (Item 152)
- ##. The fail action following a semantic predicate is not enclosed
- in "{...}". This can lead to problems when the fail action contains
- more than one statement. Reported by Douglas Cuthbertson
- (Douglas_Cuthbertson.JTIDS@jtids_qmail.hanscom.af.mil).
- (Item 153)
- ##. The UPDATE.120 of (1-Apr-94) reports that there are problems in
- combining guess mode and semantic predicates under some circumstances.
-
- Page 54
- ===============================================================================
- Ideas on the Construction of ASTs and their use with Sorcerer
- -------------------------------------------------------------------------------
- Consider the problem of a grammar which would normally require two
- passes through the source code to properly analyze. In some cases
- it is convenient to perform a first pass which creates AST trees
- and perform the second pass by analyzing the AST trees with Sorcerer.
-
- 1) Define an AST node that contains the information you'll need in the
- second pass. For example,
-
- /*
- * Parse trees are represented by an abstract-syntax-tree (AST)
- * (forward declare the pointer here). Refer to parse.h for description
- * of parse_info.
- */
- typedef struct parse_struct *ast_ref;
-
- /* parser attributes ($-symbols) & AST nodes */
- typedef struct parse_struct *pinfo_ref;
-
- /*
- * the parse structure is used to describe both attributes and
- * AST nodes
- */
-
- struct parse_struct {
- pinfo_ref right; /* points to siblings */
- pinfo_ref down; /* points to children */
- int token; /* token number (see tokens.h) */
- char *text; /* input text */
- src_pos pos; /* position in source file */
- object_ref obj; /* object description (id's) */
- type_ref typ; /* type description (expr's) */
- const_value value; /* value of a constant expression */
- } ;
-
- /*
- * define Abstract Syntax Tree (AST) nodes
- */
-
- /* ast_ref was forward-defined */
- typedef struct parse_struct AST;
-
- /*
- * the Pass-1 (parse phase) parse-attributes ($-variables)
- * have the same structure as an AST node.
- */
- typedef struct parse_struct Attrib, *Attrib_ref;
-
-
- In the code above, the parse-attribute was defined to have the same
- structure as an AST node. This isn't a requirement, but just makes it
- easier to pass information produced in the first pass on to subsequent
- passes.
-
- Page 55
- 2) Have the first pass build a symbol table as it parses the input, perform
- semantic checks, and build an AST. Use the -gt (generate tree) option on
- ANTLR, and override the automatically generated tree construction operations
- as needed. For example,
-
- var_declare:
- << pvec_ref v_list;
- int i;
- boolean has_var_section = FALSE;
- >>
- VAR^
- (
- var_id_list > [v_list] COLON
- { extern_kw
- | static_kw
- }
- type
- <<
- for (i = 0; i < v_list->len; ++i) {
- object_ref v = (object_ref) v_list->val[i];
- define_var(v, $4.typ);
- }
- >>
- { ASSIGNMENT expr
- << mark_var_use(#2, VAR_RHS); >>
- }
- SEMI
- << free_pvec(v_list); >>
- )+
- ;
- var_id_list > [pvec_ref v_list]:
- << object_ref this_var;
- $v_list = new_pvec();
- >>
- ID
- << this_var = new_var_id(&$1);
- if (this_var != NULL) append_pvec($v_list, (void *)this_var);
- >>
- (
- COMMA ID
- << this_var = new_var_id(&$2);
- if (this_var != NULL) append_pvec($v_list, (void *)this_var);
- >>
- )*
- ;
-
- The "pvec" stuff above is just a vector of pointers that can be
- extended automatically. A linked list would work just as well. The
- idea is that we must first collect the declared variables, then
- parse the type declaration, then apply bind the type to the declared
- variables. We used ANTLR's auto-tree-generation mode, and didn't
- override its actions with our own. Therefore, the following Sorcerer
- fragment will recognize the AST built for a variable declaration:
-
- Page 56
- var_declare:
- #( VAR
- ( v_list: var_id_list COLON
- { EXTERN | STATIC }
- type
- { ASSIGNMENT expr }
- SEMI
- )+
- )
- ;
- var_id_list:
- ID ( COMMA ID)*
- ;
-
- Here's an example, where we use explicit rules to build an AST:
-
- expr!:
- simple_expr
- << $expr = $1; #0 = #1; >>
- ( rel_op simple_expr
- << parse_binary_op(&$expr, &$1, &$2); #0 = #(NULL, #0, #1, #2); >>
- )*
- << $expr.token = EXPR;
- $expr.text = "expr";
- #0 = #(#[&$expr], #0);
- >>
- ;
-
- The construct, #[&$expr] first takes the address of the $expr
- attribute (attributes are structures, not pointers, in this example),
- and then applies the #[] operation which makes a call to the routine
- that creates an AST node, given an attribute (or attribute address
- in our case). It takes a while to get the hang of where the &'s
- #'s, and $'s go, but can be a real time-saver once you master it.
- What we're doing above is building a special EXPR (expression) node.
- This node would be parsed as follows in subsequent passes, using
- Sorcerer:
-
- expr: #( e: EXPR
- l_oprnd: simple_expr << e->typ = l_oprnd->typ; >>
- (op: rel_op r_oprnd: simple_expr
- <<
- e->typ = std_bool_type->obj_type;
- if (op->token == IN) {
- /* no type conversion checking for IN.
- * try to rewrite simple IN ops.
- */
- if (is_simple_in_op(l_oprnd, op, r_oprnd)) {
- rewrite_simple_in_op(l_oprnd, op, r_oprnd);
- }
- } else {
- cvt_term(&l_oprnd, op, r_oprnd, _t);
- }
- >>
- )*
- );
-
- Page 57
- We left in the actual actions of the second (Sorcerer driven) pass.
- Notice how the Sorcerer grammar labels various parts of the expr
- node ("e", "l_oprnd", "op", and "r_oprnd"). This gives the second
- pass access to each node, as it is recognized.
-
- The second pass uses the "typ" field, which contains the type of
- the ID, expression, or literal parsed by the first pass. In the
- actions above, we are propagating additional type information (for
- example, the result of a relational op is always a boolean, checking
- for implicit type conversions, and handling simple cases of Pascal's
- IN operation). The fragment above is from a Pascal to Ada translator,
- so the translator has to make Pascal's implicit type conversions
- between integer and real into explicit Ada type conversions, and
- has to convert operations on sets (i.e. IN) into operations on
- packed boolean arrays, in Ada, or calls to runtime routines.
-
- Sometimes when you are building the AST for a given construct,
- you need to use information gained from semantic analysis. An
- example is the "assignment" or "call" statement:
-
- /*
- * If a variable access appears alone, then it must be either a call to
- * procedure with no parameters, or an indirection through a pointer
- * to a procedure with no parameters.
- */
- assign_or_call_stmt!:
- << type_ref r_type = NULL;
- ast_ref v;
- >>
- variable
- << v = #1;
- $assign_or_call_stmt = $1;
- $assign_or_call_stmt.token = PROC_CALL;
- $assign_or_call_stmt.text = "proc_call";
- r_type = $assign_or_call_stmt.typ;
- if (v != NULL && v->obj != NULL
- && v->obj->obj_result != NULL
- && v->obj->obj_kind == func_obj
- && v->down->token == ID
- && v->down->right == NULL) {
- object_ref func = v->obj;
- /* function name used on left hand side;
- * convert to reference to the function's return value
- */
- v->obj = func->obj_result;
- v->typ = func->obj_result->obj_type;
- v->down->text = func->obj_result->obj_name;
- v->down->obj = func->obj_result;
- v->down->typ = func->obj_result->obj_type;
- }
- #0 = v;
- >>
-
- Page 58
- { ASSIGNMENT expr
- << $assign_or_call_stmt.token = ASSIGNMENT;
- $assign_or_call_stmt.text = ":=";
- mark_var_use(#2, VAR_RHS);
- mark_var_use(v, VAR_LHS);
- #0 = #(NULL, #0, #[&$1], #2);
- >>
- | ( LPAREN actual_param_list RPAREN
- << mark_actual_param_use(#2, r_type);
- mark_var_use(v, VAR_RHS);
- #0 = #(NULL, #0, #[&$1], #2, #[&$3]);
- >>
- )
- }
- << #0 = #( #[&$assign_or_call_stmt], #0); >>
- ;
-
- The problem we're solving is that both an assignment statement and a
- procedure call statement begin with a "variable". Since ANTLR is
- LL-based, this statement construct is "ambiguous" in that both statement
- types (assignment and call) begin with the same non-terminal. A
- "variable" includes such operations as array subscripting, pointer
- deferencing, and record field selection. Thus, a "variable" may comprise
- an arbitrary number of tokens.
-
- We might use syntactic predicates as a form a look-ahead to resolve the
- two cases above, but instead I decided to make the assumption that we have
- "PROC_CALL", and to correct that "guess" once we see the assignment
- operation. Thus, the above rule will build one of the following two AST
- structures:
-
- assign_stmt:
- #( ASSIGNMENT
- variable ASSIGNMENT expr
- )
- ;
- call_stmt:
- #( PROC_CALL
- variable {LPAREN actual_param_list RPAREN}
- )
- ;
-
- In your AST, you might want to drop unnecessary syntactic tokens such as
- ASSIGNMENT, LPAREN, RPAREN, COMMA, COLON, etc. We kept them, because we
- thought it would be necessary for certain parts of source-to-source
- translation. We don't think that's true, any more, but have not gone back
- and changed the AST structure either.
-
- Page 59
- 3) Build a separate Sorcerer grammar file to recognize the AST that you
- have built, and then add your second pass actions. These actions will
- access fields in the AST node, that were filled in by the first pass. For
- example, identifiers will probably have an "object_ref" that points to the
- object named by the identifier, and expression (EXPR) nodes will have a
- "typ" field that gives the expression's type. You might also add a "value"
- field that gives the value of a literal, named literal, or statically
- evaluated constant expression. See the code fragments above for some ideas
- on how this is done.
-
- Conclusions:
-
- 1) You'll need an ANTLR (.g) description for pass1, and a separate
- Sorcerer (.sor) description for pass2. Often the pass2 AST
- representation is much more regular and well-formed than the
- original text token stream used by pass1.
-
- 2) It can be a bit intimidating putting the pieces together.
- Try it incrementally, trying a small subset of your larger
- problem.
-
- 3) There are a lot of ways to go with how you represent
- attributes ($-variables), AST nodes, and the things that
- go on in various passes. For example, you might have
- pass1 simply build the AST and perform *no* symbol definitions
- or semantic checks. Then pass2 might walk the tree and build
- the symbol, and make various checks. Pass2 might also disambiguate
- cases that look syntactically similar, and can only be disambiguated
- using symbol definitions. Then, you could have a pass3 (another
- Sorcerer driven tree-walk) that does the 'real work' of your
- compiler/translator.
-
- Contributed by Gary Funck (gary@intrepid.com)
-
- Page 60
- ===============================================================================
- Example 1 of #lexclass
- ===============================================================================
- Borrowed code
- -------------------------------------------------------------------------------
- /*
- * Various tokens
- */
- #token "[\t\ ]+" << zzskip(); >> /* Ignore whitespace */
- #token "\n" << zzline++; zzskip(); >> /* Count lines */
-
- #token "\"" << zzmode(STRINGS); zzmore(); >>
- #token "'" << zzmode(CHARACTERS); zzmore(); >>
- #token "/\*" << zzmode(COMMENT); zzskip(); >>
- #token "//" << zzmode(CPPCOMMENT); zzskip(); >>
-
- /*
- * C++ String literal handling
- */
- #lexclass STRINGS
- #token STRING "\"" << zzmode(START); >>
- #token "\\\"" << zzmore(); >>
- #token "\\n" << zzreplchar('\n'); zzmore(); >>
- #token "\\r" << zzreplchar('\r'); zzmore(); >>
- #token "\\t" << zzreplchar('\t'); zzmore(); >>
- #token "\\[1-9][0-9]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 10));
- zzmore(); >>
- #token "\\0[0-7]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 8));
- zzmore(); >>
- #token "\\0x[0-9a-fA-F]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 16));
- zzmore(); >>
- #token "\\~[\n\r]" << zzmore(); >>
- #token "[\n\r]" << zzline++; zzmore(); /* Print warning */ >>
- #token "~[\"\n\r\\]+" << zzmore(); >>
-
- /*
- * C++ Character literal handling
- */
- #lexclass CHARACTERS
- #token CHARACTER "'" << zzmode(START); >>
- #token "\\'" << zzmore(); >>
- #token "\\n" << zzreplchar('\n'); zzmore(); >>
- #token "\\r" << zzreplchar('\r'); zzmore(); >>
- #token "\\t" << zzreplchar('\t'); zzmore(); >>
- #token "\\[1-9][0-9]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 10));
- zzmore(); >>
- #token "\\0[0-7]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 8));
- zzmore(); >>
- #token "\\0x[0-9a-fA-F]*" << zzreplchar((char)strtol(zzbegexpr, NULL, 16));
- zzmore(); >>
- #token "\\~[\n\r]" << zzmore(); >>
- #token "[\n\r]" << zzline++; zzmore(); /* Print warning */ >>
- #token "~[\'\n\r\\]" << zzmore(); >>
-
- Page 61
- /*
- * C-style comment handling
- */
- #lexclass COMMENT
- #token "\*/" << zzmode(START); zzskip(); >>
- #token "~[\*]*" << zzskip(); >>
- #token "\*~[/]" << zzskip(); >>
-
- /*
- * C++-style comment handling
- */
- #lexclass CPPCOMMENT
- #token "[\n\r]" << zzmode(START); zzskip(); >>
- #token "~[\n\r]" << zzskip(); >>
-
- #lexclass START
-
- /*
- * Assorted literals
- */
- #token OCT_NUM "0[0-7]*"
- #token L_OCT_NUM "0[0-7]*[Ll]"
- #token INT_NUM "[1-9][0-9]*"
- #token L_INT_NUM "[1-9][0-9]*[Ll]"
- #token HEX_NUM "0[Xx][0-9A-Fa-f]+"
- #token L_HEX_NUM "0[Xx][0-9A-Fa-f]+[Ll]"
- #token FLOAT_NUM "([1-9][0-9]*{.[0-9]*} | {0}.[0-9]+) {[Ee]{[\+\-]}[0-9]+}"
-
- /*
- * Identifiers
- */
- #token Identifier "[_a-zA-Z][_a-zA-Z0-9]*"
-
- Page 62
- ===============================================================================
- Example 2: ASTs
- ===============================================================================
- #header <<
-
- #include "charbuf.h"
- #include <string.h>
-
- int nextSerial;
-
- #define AST_FIELDS int token; int serial; char *text;
- #include "ast.h"
-
- #define zzcr_ast(ast,attr,tok,astText) \
- (ast)->token=tok; \
- (ast)->text=strdup( (char *) &( ( (attr)->text ) ) ); \
- nextSerial++; \
- (ast)->serial=nextSerial; \
-
- #define zzd_ast(node) delete_ast(node)
-
- void delete_ast (AST *node);
-
- >>
-
- <<
-
- AST *root=NULL;
-
- void show(AST *tree) {
- if (tree->token==ID) {
- printf (" %s <#%d> ",
- tree->text,tree->serial);}
- else {
- printf (" %s <#%d> ",
- zztokens[tree->token],
- tree->serial);
- };
- }
- void before (AST *tree) {
- printf ("(");
- }
- void after (AST *tree) {
- printf (")");
- }
-
-
- void delete_ast(AST *node) {
- printf ("\nzzd_ast called for <node #%d>\n",node->serial);
- free (node->text);
- return;
- }
-
- Page 63
- int main() {
- nextSerial=0;
- ANTLR (expr(&root),stdin);
- printf ("\n");
- zzpre_ast(root,show,before,after);
- printf ("\n");
- zzfree_ast(root);
- return 0;
- }
- >>
-
- #token WhiteSpace "[\ \t]" <<zzskip();>>
- #token ID "[a-z A-Z]*"
- #token NEWLINE "\n"
- #token OpenAngle "<"
- #token CloseAngle ">"
-
- expr : (expr0 NEWLINE)
-
- ;expr0 : expr1 {"="^ expr0}
- ;expr1 : expr2 ("\+"^ expr2)*
- ;expr2 : expr3 ("\*"^ expr3)*
- ;expr3 : ID
- -------------------------------------------------------------------------------
- Sample output from this program:
-
- a=b=c=d
- ( = <#2> a <#1> ( = <#4> b <#3> ( = <#6> c <#5> d <#7> ))) NEWLINE <#8>
- zzd_ast called for <node #7>
- zzd_ast called for <node #5>
- zzd_ast called for <node #6>
- zzd_ast called for <node #3>
- zzd_ast called for <node #4>
- zzd_ast called for <node #1>
- zzd_ast called for <node #8>
- zzd_ast called for <node #2>
-
- a+b*c
- ( \+ <#2> a <#1> ( \* <#4> b <#3> c <#5> )) NEWLINE <#6>
- zzd_ast called for <node #5>
- zzd_ast called for <node #3>
- zzd_ast called for <node #4>
- zzd_ast called for <node #1>
- zzd_ast called for <node #6>
- zzd_ast called for <node #2>
-
- a*b+c
- ( \+ <#4> ( \* <#2> a <#1> b <#3> ) c <#5> ) NEWLINE <#6>
- zzd_ast called for <node #3>
- zzd_ast called for <node #1>
- zzd_ast called for <node #5>
- zzd_ast called for <node #2>
- zzd_ast called for <node #6>
- zzd_ast called for <node #4>
-
- Page 64
- ===============================================================================
- Example 3: Syntactic Predicates
- ===============================================================================
- Not completed.
- ===============================================================================
- Example 4: DLG input function
- ===============================================================================
- This example demonstrates the use of a DLG input function to work
- around a limitation of DLG. In this example the user wants to
- recognize an exclamation mark as the first character of a line and
- treat it differently from an exclamation mark elsewhere. The
- workaround is for the input function to return a non-printing
- character (binary 1) when it finds an "!" in column 1. If it reads a
- genuine binary 1 in column 1 of the input text it returns a "?".
-
- The parse is started by:
-
- int DLGchar (void);
- ...
- ANTLRf (expr(&root),DLGchar);
- ...
- -------------------------------------------------------------------------------
- #token BANG "!"
- #token BANG_COL1 "\01"
- #token WhiteSpace "[\ \t]" <<zzskip();>>
- #token ID "[a-z A-Z]*"
- #token NEWLINE "\n"
-
- expr! : (bang <<printf ("\nThe ! is NOT in column 1\n");>>
- | bang1 <<printf ("\nThe ! is in column 1\n");>>
- | id <<printf ("\nFirst token is an ID\n");>>
- )* "@"
-
- ;bang! : BANG ID NEWLINE
-
- ;bang1! : BANG_COL1 ID NEWLINE
-
- ;id! : ID NEWLINE
- ;
- -------------------------------------------------------------------------------
-
- Page 65
- #include <stdio.h>
-
- /*
- Antlr DLG input function - See page 18 of pccts 1.00 manual
- */
-
- static int firstTime=1;
-
- static int c;
-
- int DLGchar (void) {
- if (feof(stdin)) {
- return EOF;
- };
- if (firstTime || c=='\n') {
- firstTime=0;
- c=fgetc(stdin);
- if (c==EOF) return (EOF);
- if (c=='!') return ('\001');
- if (c=='\001') return ('?');
- return (c);
- } else {
- c=fgetc(stdin);
- return (c);
- };
- };
-
- Page 66
- ===============================================================================
- Example 5: Maintaining a Stack of DLG Modes
- ===============================================================================
- Contributed by David Seidel
-
- When placed in a #lexaction or a separate file then the modifier "static"
- must be dropped from the declaration of zzauto (line 61) in "dlgauto.h".
-
- These routines have now been incorporated in pccts version 1.30b4. They
- are defined in pccts/h/err.h and are guarded by #ifdef USER_ZZMODE_STACK.
-
- This example will be dropped if they are still part of 1.31 upon its official
- release.
- -------------------------------------------------------------------------------
- #define MAX_MODE ???
- #define ZZMAXSTK (MAX_MODE * 2)
-
- static int zzmstk[ZZMAXSTK] = { -1 };
- static int zzmdep = 0;
- static char msgArea[100];
-
- void
- #ifdef __STDC__
- zzmpush( int m )
- #else
- zzmpush( m )
- int m;
- #endif
- {
- if(zzmdep == ZZMAXSTK - 1)
- { sprintf(msgArea, "Mode stack overflow ");
- zzerr(msgArea);
- }
- else
- { zzmstk[zzmdep++] = zzauto;
- zzmode(m);
- }
- }
-
- void
- zzmpop()
- {
- if(zzmdep == 0)
- { sprintf(msgArea, "Mode stack underflow ");
- zzerr(msgArea);
- }
- else
- { zzmdep--;
- zzmode(zzmstk[zzmdep]);
- }
- }
-
- Page 67
- -------------------------------------------------------------------------------
- A modified version of the above routine which allows the user to pass a
- a routine to be executed when the mode is popped from the stack.
-
- When placed in a #lexaction or a separate file then the modifier "static"
- must be dropped from the declaration of zzauto (line 61) in "dlgauto.h".
- -------------------------------------------------------------------------------
- #define ZZMAXSTK ????
-
- static int zzmstk[ZZMAXSTK] = { -1 }; /* stack of DLG modes */
- static void (*zzfuncstk[ZZMAXSTK])(); /* stack of pointer to functions */
- static int zzmdep = 0;
- static char msgArea[100];
-
- void pushMode( int m ,void (*func)())
- {
- if(zzmdep == ZZMAXSTK - 1)
- { sprintf(msgArea, "Mode stack overflow ");
- zzerr(msgArea);
- }
- else
- { zzmstk[zzmdep] = zzauto;
- zzfuncstk[zzmdep] = func;
- zzmdep++;
- zzmode(m);
- }
- }
-
- void popMode()
- {
- void (*thisFunc)();
- if(zzmdep == 0)
- { sprintf(msgArea, "Mode stack underflow ");
- zzerr(msgArea);
- }
- else
- { zzmdep--;
- thisFunc=zzfuncstk[zzmdep];
- zzmode(zzmstk[zzmdep]);
- zzmstk[zzmdep]=0;
- zzfuncstk[zzmdep]=0;
- /* this call might result in indirect recursion of popMode() */
- if (thisFunc!=0) {
- (*thisFunc)();
- };
- }
- }
-
- void resetModeStack() {
- zzmdep=0;
- zzmstk[0]=0;
- zzfuncstk[0]=0;
- }
-
- /* if the lookahead character is a semi-colon then keep on popping */
-
- void popOnSC() {
- if (zzchar==';') popMode();
- }
-
- Page 68
- ===============================================================================
- Example 6: Debug code for create_ast, mk_ast, delete_ast to locate lost ASTs
- ===============================================================================
- This is an example of code which tries to keep track of lost ASTs using
- a doubly linked list of all ASTs maintained by calls from create_ast()
- and mk_ast() to zzastnew_userhook(). When ASTs are deleted by calls
- to zzastdelete_userhook() from the user's AST delete routines they are
- removed from the doubly linked list. Any ASTs left over after zzfree_ast()
- must be considered lost.
-
- This method does not monitor ASTs created by zzdup_ast() because it does
- not call the create_ast() or mk_ast() routines.
- -------------------------------------------------------------------------------
- The #header section must include a definition of AST_FIELDS with the
- equivalent of:
-
- struct _ast *flink, *blink;
- -------------------------------------------------------------------------------
- int main() {
- ...
- again:
- ...
- reset_ASTlistHead(); /* <======================== */
- ANTLR (sourcecode(&root),stdin);
- treewalk(root);
- zzfree_ast(root);
- root=NULL;
- print_lost_ast(); /* <======================= */
- printf ("\n");}
- ...
- goto again;
- ...
- }
- -------------------------------------------------------------------------------
- #ifndef H_ZZNEWAST_USERHOOK
- #define H_ZZNEWAST_USERHOOK
-
- void reset_ASTlistHead(void);
- void zzunhook_tree(void);
- void zzastnew_userhook(AST *newNode);
- void zzastdelete_userhook (AST *singleNode);
- void print_lost_ast (void);
- void treewalk(AST *tree);
-
- #endif
- -------------------------------------------------------------------------------
- #include "stdpccts.h"
- #include "stdlib.h"
- #include "zzastnew_userhook.h"
-
- static AST ASTlistHead;
- static int ASTserialNumber;
-
- void reset_ASTlistHead(void) {
- while (ASTlistHead.flink!=0 && ASTlistHead.flink!= &ASTlistHead) {
- zzfree_ast(ASTlistHead.flink);
- };
- ASTlistHead.flink=&ASTlistHead;
- ASTlistHead.blink=&ASTlistHead;
- ASTserialNumber=1;
- return;
- }
-
- Page 69
- /* Stop tracking ASTs in a tree without actually deleting them */
-
- void zzunhook_tree (AST * tree) {
-
- while (tree != 0) {
- zzunhook_tree (tree->down);
- zzastdelete_userhook (tree);
- tree=tree->right;
- };
- return;
- }
-
- /* Track new AST */
-
- void zzastnew_userhook(AST *newNode) {
-
- AST *prev;
-
- prev=ASTlistHead.blink;
- prev->flink=newNode;
- ASTlistHead.blink=newNode;
- newNode->blink=prev;
- newNode->flink=&ASTlistHead;
- newNode->serialNumber=ASTserialNumber;
- ASTserialNumber++;
- return;
- }
-
- /* Stop tracking an AST */
-
- void zzastdelete_userhook (AST *singleNode) {
-
- AST *fnode;
- AST *bnode;
-
- if (singleNode!=0) {
- fnode=singleNode->flink;
- bnode=singleNode->blink;
- fnode->blink=bnode;
- bnode->flink=fnode;
- singleNode->serialNumber=0;
- singleNode->flink=0;
- singleNode->blink=0;
- };
- return;
- }
-
- /* Print ASTs that are still on list */
-
- void print_lost_ast () {
-
- AST *node;
-
- for (node=ASTlistHead.flink;
- node!=0 && node!= &ASTlistHead;
- node=node->flink) {
- printf ("**** Start of lost AST listing **** %d\n",node->serialNumber);
- treewalk (node); /* user supplied routine */
- printf ("\n**** End of lost AST listing ****\n");
- };
- }
-
- Page 70
- -------------------------------------------------------------------------------
- These routines print out the AST tree. This will be application dependent.
- -------------------------------------------------------------------------------
- #include "stdpccts.h"
- #include "stdlib.h"
-
- static int treenest=0;
-
- void treeindent(int nesting) {
- int i;
- for (i=0;i<nesting*2;i++) {
- printf (" ");
- };
- return;
- }
-
- void treewalk1 (AST *tree) {
- while (tree != NULL) {
- treeindent(treenest);
- printf ("%s",zztokens[tree->token]);
- if (tree->text != NULL) {
- printf (" %s",tree->text);
- };
- printf ("\n");
- treenest++;
- treewalk1 (tree->down);
- treenest--;
- tree=tree->right;
- };
- return;
- }
-
- void treewalk (AST *tree) {
- treenest=0;
- treewalk1(tree);
- return;
- }
-
- Page 71
- ===============================================================================
- Example 7: Difference Between Various Types of Lookahead in Antlr/DLG
- ===============================================================================
- The following grammar with k=1 and standard lookahead is meant to show how
- zzlextext and LATEXT(i) differ for the case k=1 and k=3 (see later
- examples).
-
- The use of LA(1) and LATEXT(1) in semantic predicates is OK, but their use
- in actions is NOT recommended because, as the examples below show, there is
- a variation in what LATEXT(1) means when it appears in an action.
-
- Use attributes to refer to tokens already encountered.
- -------------------------------------------------------------------------------
- #header <<
-
- #include "charbuf.h"
-
- #define ZZCOL
-
- >>
-
- <<
-
- /* Can't put quoted strings in semantic predicates in version 1.23 */
-
- #define Semantic_Predicate_Of_1 "Semantic Predicate Of 1"
-
- int AntlrCount=0;
-
- int main() {
- again: ANTLR (statement(),stdin);
- return 0;
- }
-
- #define LANL(i) (*LATEXT(i) == '\n' ? "NL" : LATEXT(i))
- #define LAINFNL(i) (*ZZINF_LATEXT(i) == '\n' ? "NL" : ZZINF_LATEXT(i))
-
- void laDump(char * label) {
- AntlrCount++;
- printf ("\tRecognized: %s (AntlrCount=%d)\n",label,AntlrCount);
- printf ("\tValue of zzbegcol: %d\n",zzbegcol);
- printf ("\tLATEXT(0..1)={%s,%s}\n",
- LANL(0),LANL(1));
- printf ("\tzzlextext=%s\n",(zzlextext[0]=='\n' ? "NL" : zzlextext) );
- #ifdef ZZINF_LOOK
- printf ("\tZZINF_LATEXT(0..1)={%s,%s}\n",
- LAINFNL(0),LAINFNL(1));
- #endif
- return;
- }
-
- >>
-
- #lexaction <<
-
- int DLGcount=0;
-
- >>
-
- Page 72
- #token ID "[a-z]*"
- <<DLGcount++;printf("DLGcount: %d Col %d ID=(%s)\n",
- DLGcount,zzbegcol,zzlextext);>>
- #token WS "[\ \t]*"
- <<DLGcount++;printf("DLGcount: %d Col %d WS\n",DLGcount,zzbegcol);zzskip();>>
- #token NL "\n"
- <<DLGcount++;printf("DLGcount: %d Col %d NL\n",
- DLGcount,zzbegcol);
- zzendcol=0;
- zzline++;>>
-
- statement : (formats) * "@"
-
- ;formats
- : format1
- | format2
-
- ;format1 :
- <<(laDump(Semantic_Predicate_Of_1),1)>>?
- ID
- <<laDump("After first ID of 1");>>
- ID
- <<laDump("After second ID of 1");>>
- ID
- <<laDump("After third ID of 1");>>
- NL
- <<laDump("-> Format 1 After: ID ID ID NL");>>
- ;format2 :
- ID ID ID;
- -------------------------------------------------------------------------------
- The input data file:
- -------------------------------------------------------------------------------
- a b c
- d e f
-
- Page 73
- -------------------------------------------------------------------------------
- The output from the standard and the interactive parser were identical in
- this case.
- -------------------------------------------------------------------------------
- DLGcount: 1 Col 1 ID=(a)
- Recognized: Semantic Predicate Of 1 (AntlrCount=1)
- Value of zzbegcol: 1
- LATEXT(0..1)={a,a}
- zzlextext=a
- Recognized: Semantic Predicate Of 1 (AntlrCount=2)
- Value of zzbegcol: 1
- LATEXT(0..1)={a,a}
- zzlextext=a
- Recognized: Semantic Predicate Of 1 (AntlrCount=3)
- Value of zzbegcol: 1
- LATEXT(0..1)={a,a}
- zzlextext=a
- Recognized: After first ID of 1 (AntlrCount=4)
- Value of zzbegcol: 1
- LATEXT(0..1)={a,a}
- zzlextext=a
- DLGcount: 2 Col 2 WS
- DLGcount: 3 Col 3 ID=(b)
- Recognized: After second ID of 1 (AntlrCount=5)
- Value of zzbegcol: 3
- LATEXT(0..1)={b,b}
- zzlextext=b
- DLGcount: 4 Col 4 WS
- DLGcount: 5 Col 5 ID=(c)
- Recognized: After third ID of 1 (AntlrCount=6)
- Value of zzbegcol: 5
- LATEXT(0..1)={c,c}
- zzlextext=c
- DLGcount: 6 Col 6 NL
- Recognized: -> Format 1 After: ID ID ID NL (AntlrCount=7)
- Value of zzbegcol: 6
- LATEXT(0..1)={NL,NL}
- zzlextext=NL
- DLGcount: 7 Col 1 ID=(d)
- Recognized: Semantic Predicate Of 1 (AntlrCount=8)
- Value of zzbegcol: 1
- LATEXT(0..1)={d,d}
- zzlextext=d
- Recognized: Semantic Predicate Of 1 (AntlrCount=9)
- Value of zzbegcol: 1
- LATEXT(0..1)={d,d}
- zzlextext=d
- Recognized: Semantic Predicate Of 1 (AntlrCount=10)
- Value of zzbegcol: 1
- LATEXT(0..1)={d,d}
- zzlextext=d
- Recognized: After first ID of 1 (AntlrCount=11)
- Value of zzbegcol: 1
- LATEXT(0..1)={d,d}
- zzlextext=d
- DLGcount: 8 Col 2 WS
- <remaining output omitted>
-
- Page 74
- -------------------------------------------------------------------------------
- The same grammar and input file when compiled with -DZZINF_LOOK
- -------------------------------------------------------------------------------
- DLGcount: 1 Col 1 ID=(a)
- DLGcount: 2 Col 2 WS
- DLGcount: 3 Col 3 ID=(b)
- DLGcount: 4 Col 4 WS
- DLGcount: 5 Col 5 ID=(c)
- DLGcount: 6 Col 6 NL
- DLGcount: 7 Col 1 ID=(d)
- DLGcount: 8 Col 2 WS
- DLGcount: 9 Col 3 ID=(e)
- DLGcount: 10 Col 4 WS
- DLGcount: 11 Col 5 ID=(f)
- DLGcount: 12 Col 6 NL
- Recognized: Semantic Predicate Of 1 (AntlrCount=1)
- Value of zzbegcol: 1
- LATEXT(0..1)={a,a}
- zzlextext=a
- ZZINF_LATEXT(0..1)={a,b}
- Recognized: Semantic Predicate Of 1 (AntlrCount=2)
- Value of zzbegcol: 1
- LATEXT(0..1)={a,a}
- zzlextext=a
- ZZINF_LATEXT(0..1)={a,b}
- Recognized: Semantic Predicate Of 1 (AntlrCount=3)
- Value of zzbegcol: 1
- LATEXT(0..1)={a,a}
- zzlextext=a
- ZZINF_LATEXT(0..1)={a,b}
- Recognized: After first ID of 1 (AntlrCount=4)
- Value of zzbegcol: 1
- LATEXT(0..1)={a,a}
- zzlextext=a
- ZZINF_LATEXT(0..1)={a,b}
- Recognized: After second ID of 1 (AntlrCount=5)
- Value of zzbegcol: 1
- LATEXT(0..1)={b,b}
- zzlextext=b
- ZZINF_LATEXT(0..1)={b,c}
- Recognized: After third ID of 1 (AntlrCount=6)
- Value of zzbegcol: 1
- LATEXT(0..1)={c,c}
- zzlextext=c
- ZZINF_LATEXT(0..1)={c,NL}
- Recognized: -> Format 1 After: ID ID ID NL (AntlrCount=7)
- Value of zzbegcol: 1
- LATEXT(0..1)={NL,NL}
- zzlextext=NL
- ZZINF_LATEXT(0..1)={NL,d}
- Recognized: Semantic Predicate Of 1 (AntlrCount=8)
- Value of zzbegcol: 1
- LATEXT(0..1)={d,d}
- zzlextext=d
- ZZINF_LATEXT(0..1)={d,e}
- <remaining output omitted>
-
- Page 75
- -------------------------------------------------------------------------------
- The following grammar with k=3 is meant to show aspects of lookahead choices.
- -------------------------------------------------------------------------------
- #header <<
-
- #include "charbuf.h"
-
- #define ZZCOL
-
- >>
-
- <<
-
- /* Can't put quoted strings in semantic predicates in version 1.23 */
-
- #define Semantic_Predicate_Of_1 "Semantic Predicate Of 1"
-
- int AntlrCount=0;
-
- int main() {
- again: ANTLR (statement(),stdin);
- return 0;
- }
-
- #define LANL(i) (*LATEXT(i) == '\n' ? "NL" : LATEXT(i))
- #define LAINFNL(i) (*ZZINF_LATEXT(i) == '\n' ? "NL" : ZZINF_LATEXT(i))
-
- void laDump(char * label) {
- AntlrCount++;
- printf ("\tRecognized: %s (AntlrCount=%d)\n",label,AntlrCount);
- printf ("\tValue of zzbegcol: %d\n",zzbegcol);
- printf ("\tLATEXT(0..3)={%s,%s,%s,%s}\n",
- LANL(0),LANL(1),LANL(2),LANL(3));
- printf ("\tzzlextext=%s\n",(zzlextext[0]=='\n' ? "NL" : zzlextext) );
- #ifdef ZZINF_LOOK
- printf ("\tZZINF_LATEXT(0..3)={%s,%s,%s,%s}\n",
- LAINFNL(0),LAINFNL(1),LAINFNL(2),LAINFNL(3));
- #endif
- return;
- }
-
- >>
-
- #lexaction <<
-
- int DLGcount=0;
-
- >>
-
- #token ID "[a-z]*"
- <<DLGcount++;printf("DLGcount: %d Col %d ID=(%s)\n",
- DLGcount,zzbegcol,zzlextext);>>
- #token WS "[\ \t]*"
- <<DLGcount++;printf("DLGcount: %d Col %d WS\n",DLGcount,zzbegcol);zzskip();>>
- #token NL "\n"
- <<DLGcount++;printf("DLGcount: %d Col %d NL\n",
- DLGcount,zzbegcol);
- zzendcol=0;
- zzline++;>>
-
- Page 76
- statement : (formats) * "@"
-
- ;formats
- : format1
- | format2
-
- ;format1 :
- <<(laDump(Semantic_Predicate_Of_1),1)>>?
- ID
- <<laDump("After first ID of 1");>>
- ID
- <<laDump("After second ID of 1");>>
- ID
- <<laDump("After third ID of 1");>>
- NL
- <<laDump("-> Format 1 After: ID ID ID NL");>>
- ;format2 :
- ID ID ID NL <<laDump("-> Format 2: ID ID ID");>>
- ;
- -------------------------------------------------------------------------------
- The input data file:
- -------------------------------------------------------------------------------
- a b c
- d e f
-
- Page 77
- -------------------------------------------------------------------------------
- When built with version 1.23 and "standard" options: AFLAGS = -k 3
- -------------------------------------------------------------------------------
- DLGcount: 1 Col 1 ID=(a)
- DLGcount: 2 Col 2 WS
- DLGcount: 3 Col 3 ID=(b)
- DLGcount: 4 Col 4 WS
- DLGcount: 5 Col 5 ID=(c)
- DLGcount: 6 Col 6 NL
- Recognized: Semantic Predicate Of 1 (AntlrCount=1)
- Value of zzbegcol: 6
- LATEXT(0..3)={NL,a,b,c}
- zzlextext=a
- Recognized: Semantic Predicate Of 1 (AntlrCount=2)
- Value of zzbegcol: 6
- LATEXT(0..3)={NL,a,b,c}
- zzlextext=a
- Recognized: Semantic Predicate Of 1 (AntlrCount=3)
- Value of zzbegcol: 6
- LATEXT(0..3)={NL,a,b,c}
- zzlextext=a
- Recognized: After first ID of 1 (AntlrCount=4)
- Value of zzbegcol: 6
- LATEXT(0..3)={NL,a,b,c}
- zzlextext=a
- DLGcount: 7 Col 1 ID=(d)
- Recognized: After second ID of 1 (AntlrCount=5)
- Value of zzbegcol: 1
- LATEXT(0..3)={d,b,c,NL}
- zzlextext=b
- DLGcount: 8 Col 2 WS
- DLGcount: 9 Col 3 ID=(e)
- Recognized: After third ID of 1 (AntlrCount=6)
- Value of zzbegcol: 3
- LATEXT(0..3)={e,c,NL,d}
- zzlextext=c
- DLGcount: 10 Col 4 WS
- DLGcount: 11 Col 5 ID=(f)
- Recognized: -> Format 1 After: ID ID ID NL (AntlrCount=7)
- Value of zzbegcol: 5
- LATEXT(0..3)={f,NL,d,e}
- zzlextext=NL
- DLGcount: 12 Col 6 NL
- Recognized: Semantic Predicate Of 1 (AntlrCount=8)
- Value of zzbegcol: 6
- LATEXT(0..3)={NL,d,e,f}
- zzlextext=d
- Recognized: Semantic Predicate Of 1 (AntlrCount=9)
- Value of zzbegcol: 6
- LATEXT(0..3)={NL,d,e,f}
- zzlextext=d
- Recognized: Semantic Predicate Of 1 (AntlrCount=10)
- Value of zzbegcol: 6
- LATEXT(0..3)={NL,d,e,f}
- zzlextext=d
- Recognized: After first ID of 1 (AntlrCount=11)
- Value of zzbegcol: 6
- LATEXT(0..3)={NL,d,e,f}
- zzlextext=d
- <remaining output omitted>
-
- Page 78
- -------------------------------------------------------------------------------
- When built with version 1.23 and "interactive" options: AFLAGS = -k 3 -gk
- -------------------------------------------------------------------------------
- DLGcount: 1 Col 1 ID=(a)
- Recognized: Semantic Predicate Of 1 (AntlrCount=1)
- Value of zzbegcol: 1
- LATEXT(0..3)={,a,,}
- zzlextext=
- DLGcount: 2 Col 2 WS
- DLGcount: 3 Col 3 ID=(b)
- DLGcount: 4 Col 4 WS
- DLGcount: 5 Col 5 ID=(c)
- Recognized: Semantic Predicate Of 1 (AntlrCount=2)
- Value of zzbegcol: 5
- LATEXT(0..3)={,a,b,c}
- zzlextext=
- Recognized: Semantic Predicate Of 1 (AntlrCount=3)
- Value of zzbegcol: 5
- LATEXT(0..3)={,a,b,c}
- zzlextext=
- Recognized: After first ID of 1 (AntlrCount=4)
- Value of zzbegcol: 5
- LATEXT(0..3)={a,b,c,}
- zzlextext=
- Recognized: After second ID of 1 (AntlrCount=5)
- Value of zzbegcol: 5
- LATEXT(0..3)={b,c,,a}
- zzlextext=
- Recognized: After third ID of 1 (AntlrCount=6)
- Value of zzbegcol: 5
- LATEXT(0..3)={c,,a,b}
- zzlextext=
- DLGcount: 6 Col 6 NL
- Recognized: -> Format 1 After: ID ID ID NL (AntlrCount=7)
- Value of zzbegcol: 6
- LATEXT(0..3)={NL,a,b,c}
- zzlextext=a
- DLGcount: 7 Col 1 ID=(d)
- Recognized: Semantic Predicate Of 1 (AntlrCount=8)
- Value of zzbegcol: 1
- LATEXT(0..3)={NL,d,b,c}
- zzlextext=b
- DLGcount: 8 Col 2 WS
- DLGcount: 9 Col 3 ID=(e)
- DLGcount: 10 Col 4 WS
- DLGcount: 11 Col 5 ID=(f)
- Recognized: Semantic Predicate Of 1 (AntlrCount=9)
- Value of zzbegcol: 5
- LATEXT(0..3)={NL,d,e,f}
- zzlextext=NL
- Recognized: Semantic Predicate Of 1 (AntlrCount=10)
- Value of zzbegcol: 5
- LATEXT(0..3)={NL,d,e,f}
- zzlextext=NL
- Recognized: After first ID of 1 (AntlrCount=11)
- Value of zzbegcol: 5
- LATEXT(0..3)={d,e,f,NL}
- zzlextext=NL
- <remaining output omitted>
-
- Page 79
- -------------------------------------------------------------------------------
- When built with version 1.23 and infinite lookahead options:
- AFLAGS = -k 3
- CFLAGS = -DZZINF_LOOK
- -------------------------------------------------------------------------------
- DLGcount: 1 Col 1 ID=(a)
- DLGcount: 2 Col 2 WS
- DLGcount: 3 Col 3 ID=(b)
- DLGcount: 4 Col 4 WS
- DLGcount: 5 Col 5 ID=(c)
- DLGcount: 6 Col 6 NL
- DLGcount: 7 Col 1 ID=(d)
- DLGcount: 8 Col 2 WS
- DLGcount: 9 Col 3 ID=(e)
- DLGcount: 10 Col 4 WS
- DLGcount: 11 Col 5 ID=(f)
- DLGcount: 12 Col 6 NL
- Recognized: Semantic Predicate Of 1 (AntlrCount=1)
- Value of zzbegcol: 1
- LATEXT(0..3)={NL,a,b,c}
- zzlextext=a
- ZZINF_LATEXT(0..3)={a,b,c,NL}
- Recognized: Semantic Predicate Of 1 (AntlrCount=2)
- Value of zzbegcol: 1
- LATEXT(0..3)={NL,a,b,c}
- zzlextext=a
- ZZINF_LATEXT(0..3)={a,b,c,NL}
- Recognized: Semantic Predicate Of 1 (AntlrCount=3)
- Value of zzbegcol: 1
- LATEXT(0..3)={NL,a,b,c}
- zzlextext=a
- ZZINF_LATEXT(0..3)={a,b,c,NL}
- Recognized: After first ID of 1 (AntlrCount=4)
- Value of zzbegcol: 1
- LATEXT(0..3)={NL,a,b,c}
- zzlextext=a
- ZZINF_LATEXT(0..3)={a,b,c,NL}
- Recognized: After second ID of 1 (AntlrCount=5)
- Value of zzbegcol: 1
- LATEXT(0..3)={d,b,c,NL}
- zzlextext=b
- ZZINF_LATEXT(0..3)={b,c,NL,d}
- Recognized: After third ID of 1 (AntlrCount=6)
- Value of zzbegcol: 1
- LATEXT(0..3)={e,c,NL,d}
- zzlextext=c
- ZZINF_LATEXT(0..3)={c,NL,d,e}
- Recognized: -> Format 1 After: ID ID ID NL (AntlrCount=7)
- Value of zzbegcol: 1
- LATEXT(0..3)={f,NL,d,e}
- zzlextext=NL
- ZZINF_LATEXT(0..3)={NL,d,e,f}
- Recognized: Semantic Predicate Of 1 (AntlrCount=8)
- Value of zzbegcol: 1
- LATEXT(0..3)={NL,d,e,f}
- zzlextext=d
- ZZINF_LATEXT(0..3)={d,e,f,NL}
- <remaining output omitted>
-
- Page 80
- ===============================================================================
- Example 8: Preserving whitespace during lexing
- ===============================================================================
- The following program passes whitespace through DLG to the parser by
- combining the whitespace with the token which follows it. It is up to the
- user to determine how to handle the leading whitespace during attribute
- and AST creation.
-
- In this example whitespace ("#token WS") includes only the space character:
- it does not include tab or newline. Maintaining accurate column
- information when using zzmore() requires some extra work (as mentioned
- in a note in the section on lexical issues.
-
- The routines in "charbuf.h" assume that tokens are no longer than
- "D_TextSize" characters. The value can be changed from its default value
- of 30 by "#define D_TextSize ..." in the #header prior to the #include of
- "charbuf.h".
-
- It was built with k=1.
- -------------------------------------------------------------------------------
- #header <<
-
- #include "charbuf.h"
-
- #define ZZCOL
-
- >>
-
- <<
-
- int AntlrCount=0;
-
- int main() {
- again: ANTLR (statement(),stdin);
- return 0;
- }
-
- static char xlateBuf[100];
-
- char * xlate (char * s) {
- char * p=s;
- char * q=xlateBuf;
- if (*p == 0) {
- *q='@';q++;
- };
- while (*p != 0) {
- if (*p == ' ') {
- *q='-';q++;
- } else if (*p == '\t') {
- *q='\\';q++;*q='t';q++;
- } else if (*p == '\n') {
- *q='\\';q++;*q='n';q++;
- } else {
- *q=*p;q++;
- };
- p++;
- };
- *q=0;
- return (xlateBuf);
- }
-
- Page 81
- void laDump(char * label) {
- AntlrCount++;
- printf ("\tRecognized: %s (AntlrCount=%d) ",label,AntlrCount);
- printf ("zzlextext=(%s)\n",xlate(zzlextext));
- return;
- }
-
- >>
-
- #lexaction <<
-
- int DLGcount=0;
-
- >>
-
- #token ID "[a-z]*"
- <<DLGcount++;printf("DLGcount: %d Col %d ID=(%s)\n",
- DLGcount,zzbegcol,zzlextext);>>
- #token WS "[\ ]*"
- <<DLGcount++;printf("DLGcount: %d Col %d WS\n",DLGcount,zzbegcol);
- zzmore();>>
- #token NL "\n"
- <<DLGcount++;printf("DLGcount: %d Col %d NL\n",
- DLGcount,zzbegcol);
- zzendcol=0;
- zzline++;
- >>
-
- statement : (line) * "@"
-
- ;line : (ID <<laDump("ID");>> ) * NL
-
- ;
-
- Page 82
- ===============================================================================
- Example 9: Passing column information through DLG using a kludge
- ===============================================================================
- The following demonstrates a kludge which allows one to pass column
- information through DLG for for use with attributes (or ASTs) even when
- using modes with lookahead with LL_K>1 or using infinite lookahead
- mode. This technique is probably not necessary with C++ mode.
- -------------------------------------------------------------------------------
- #header <<
-
- #include "col_charbuf.h"
-
- #define ZZCOL
-
- #include "shiftr.h"
-
- #define COL_BITS_PER_BYTE 6
- #define COL_BITS_MASK ( (1 << COL_BITS_PER_BYTE) - 1 )
-
- >>
-
- <<
-
- int main() {
- again: ANTLR (statement(),stdin);
- return 0;
- }
-
- void create_attr (Attrib *a,int tok,char *t) {
- char * p;
- char * q;
- int i=0;
-
- a->col=0;
-
- for (p=t;*p != '\001' && *p != 0;p++) {
- if (i < D_TextSize-1) {
- a->text[i]=*p;
- i++;
- };
- };
-
- a->text[i]=0;
-
- if (*p == '\001') {
- a->col=p[1] & COL_BITS_MASK +
- ( (p[2] & COL_BITS_MASK) << COL_BITS_PER_BYTE );
- };
-
- printf ("create_attr: Col %d text=(%s)\n",a->col,a->text);
- return;
- }
-
- >>
-
- Page 83
- #lexaction <<
-
- int DLGcount=0;
- char encodedCol[5];
-
- void record() {
- encodedCol[0]='\001';
- encodedCol[1]=zzbegcol & COL_BITS_MASK;
- encodedCol[2]=(zzbegcol SHIFTR COL_BITS_PER_BYTE) & COL_BITS_MASK;
- encodedCol[4]=0;
- /***
- **** if (strlen(zzlextext) > ZZLEXBUFSIZE - sizeof(encodedCol) ) {...}
- ***/
- strcat(zzlextext,encodedCol);
- return;
- }
- >>
-
- #token ID "[a-z A-Z 0-9]*"
- <<DLGcount++;printf("DLGcount: %d Col %d ID=(%s)\n",
- DLGcount,zzbegcol,zzlextext);record();>>
- #token WS "[\ \t]*"
- <<DLGcount++;printf("DLGcount: %d Col %d WS\n",
- DLGcount,zzbegcol);zzskip();>>
- #token NL "\n"
- <<DLGcount++;printf("DLGcount: %d Col %d NL\n",
- DLGcount,zzbegcol);
- zzendcol=0;
- zzline++;
- zzskip();>>
-
- statement : (formats) * "@" ;
- formats : ( ID ) * NL ;
- -------------------------------------------------------------------------------
- File: col_charbuf.h
- -------------------------------------------------------------------------------
- #ifndef ZZCHARBUF_H
- #define ZZCHARBUF_H
-
- #include <string.h>
-
- #ifndef D_TextSize
- #define D_TextSize 30
-
- #endif
-
- typedef struct {
- char text[D_TextSize];
- int col;
- } Attrib;
-
- void create_attr(Attrib *a,int tok,char *t);
-
- #define zzcr_attr(a,tok,t) create_attr(a,tok,t)
-
- #endif
- -------------------------------------------------------------------------------
- File: shiftr.h
- -------------------------------------------------------------------------------
- #ifndef SHIFTR
- #define SHIFTR >>
- #endif
-
- Page 84
- ===============================================================================
- Example 10: Use of #lexclass
- ===============================================================================
- The user has a grammar in which an asterisk ("*") is normally used to indicate
- multiplication. However, if "*" is the first token appearing in a statement
- then it indicates a comment. Comments are terminated by a newline. Statements
- are separated by semi-colons (";"). How does one use #lexclass to separate
- the different lexical analysis required for comments and arithmetic
- statements ?
-
- For this example the recognized tokens have been reduced to identifiers and "*".
-
- This code requires many #token actions to have the statement:
-
- foundToken=1;
-
- If this is inconvenient the user can modify dlgauto.h as outlined in
- "Section on ANTLR/DLG Internals" to call a user-supplied routine (defined
- inside the #lexaction) just after each call to the #token action routine.
- -------------------------------------------------------------------------------
- #header <<
-
- #include "charbuf.h"
-
- >>
-
- <<
- int main() {
- again: ANTLR (program(),stdin);
- return 0;
- }
- >>
-
- #lexaction <<
-
- int foundToken=0;
-
- >>
-
- #lexclass START
-
- #token ID "[a-z A-Z]*" <<foundToken=1;>>
- #token SC ";" <<foundToken=0;>>
- #token WS "[\ \t]*" <<zzskip();>>
- #token NL "\n" <<zzskip();>>
- #token STAR "\*" <<if (foundToken == 0) {
- zzmode(LC_COMMENT);
- zzmore();};
- >>
- #lexclass LC_COMMENT
- #token COMMENT "~[\n]*" <<foundToken=0;
- zzmode(START);
- >>
- program : (statement) * "@"
-
- ;statement
- : COMMENT <<printf ("comment: %s\n",$1.text);>>
- | (ID | STAR ) * SC <<printf ("semi-colon\n");>>
- ;
-
- Page 85
- ===============================================================================
- Example 11: Use of zzchar and #lexclass
- ===============================================================================
- Consider the problem of distinguishing floating point numbers from
- range expressions such as those used in Pascal:
-
- range: 1..23
- range: a..z
- float: 1.23
-
- As a first effort one might try:
-
- #token ID "[a-z]*"
- #token Int "[0-9]*"
- #token Range ".."
- #token Float "[0-9]*.[0-9]*"
-
- The problem is that "1..23" looks like the floating point number "1." with
- an illegal "." at the end. DLG always takes the longest matching string,
- so "1." will always look more appetizing than "1". What one needs to do
- is to look at the character following "1." to see if it is another ".",
- and if it is to assume that it is a range expression. The flex lexer has
- trailing context, but DLG doesn't - except for the single character in
- zzchar.
-
- A solution in DLG is to write the #token Float action routine to look
- at what's been accepted and at zzchar in order to decide what to do:
- ------------------------------------------------------------------------
- #header <<#include "int.h">>
-
- #token Range ".."
- #token Int "[0-9]*"
- #token Float "[0-9]*.[0-9]*"
- <<if (*zzendexpr == '.' && /* might use more complex test */
- zzchar == '.') {
- NLA=Int;
- zzmode(LC_Range);
- };
- >>
- #token WS "\ " <<zzskip();>>
- #token NL "\n" <<zzskip();>>
-
- #lexclass LC_Range
-
- // consume second "." of range token ("..") and return to normal mode
-
- #token Range "." <<zzmode(START);>>
-
- << int main() {
- ANTLR (rule(),stdin);
- }
- >>
- rule: ( Range <<printf ("range\n");>>
- | Int <<printf ("int\n");>>
- | Float <<printf ("float\n");>>
- )*
- ;
-
- Page 86
- ===============================================================================
- Example 12: Rewriting a grammar so it be handled by Antlr
- ===============================================================================
- The original grammar was in this form:
-
- command := SET var BECOMES expr
- | SET var BECOMES QUOTE QUOTE
- | SET var BECOMES QUOTE expr QUOTE
- | SET var BECOMES QUOTE command QUOTE
-
- expr := QUOTE anyCharButQuote QUOTE
- | expr AddOp expr
- | expr MulOp expr
-
- The repetition of "SET var BECOMES" for command would require k=4 to
- get to the interesting part. The first step is to left-factor command:
-
- command := SET var BECOMES
- ( expr
- | QUOTE QUOTE
- | QUOTE expr QUOTE
- | QUOTE command QUOTE
- )
-
- The definition of expr uses left recursion which must be eliminated
- when using Antlr:
-
- op := AddOp
- | MulOp
-
- expr := QUOTE anyCharButQuote QUOTE (op expr)*
-
- Since expr begins with QUOTE and all the alternatives of the sub-rule
- of command also start with QUOTE this too can be left-factored:
-
- command := SET var BECOMES QUOTE
- ( expr_suffix
- | QUOTE
- | expr QUOTE
- | command QUOTE
- )
-
- expr_suffix := anyCharButQuote QUOTE (op expr)*
- expr := QUOTE expr_suffix
-
- The final grammar can be built by Antlr with k=2.
-
- Page 87
- #header <<#include "charbuf.h">>
-
- <<
- int main() {
- ANTLR(repeat(),stdin);
- return 0;
- }
- >>
- #token Q "\""
- #token SVB "svb"
- #token Qbar "[a-z A-Z]*"
- #token AddOp "\+"
- #token MulOp "\*"
- #token WS "\ " <<zzskip();>>
- #token NL "\n" <<zzskip();>>
-
- repeat : ( command )+ "@";
- command : SVB Q ( expr_suffix
- | expr Q
- | Q <<printf("null command\n");>>
-
- | command Q <<printf("command\n");>>
- );
-
- expr_suffix : Qbar Q <<printf("The Qbar expr is (%s)\n",$1.text);>>
- { op expr };
- expr : Q expr_suffix;
- op : AddOp | MulOp ;
- -------------------------------------------------------------------------------
-